3 navsection: installguide
4 title: Install Node Manager
7 Arvados Node Manager provides elastic computing for Arvados and SLURM by creating and destroying virtual machines on demand. Node Manager currently supports Amazon Web Services (AWS), Google Cloud Platform (GCP) and Microsoft Azure.
9 Note: node manager is only required for elastic computing cloud environments. Fixed size clusters do not require node manager.
13 Node manager may run anywhere, however it must be able to communicate with the cloud provider's APIs, and use the command line tools @sinfo@, @squeue@ and @scontrol@ to communicate with the cluster's SLURM controller.
15 On Debian-based systems:
18 <pre><code>~$ <span class="userinput">sudo apt-get install arvados-node-manager</span>
22 On Red Hat-based systems:
25 <pre><code>~$ <span class="userinput">sudo yum install arvados-node-manager</span>
29 h2. Create compute image
31 Configure a virtual machine following the "instructions to set up a compute node.":{{site.baseurl}}/install/crunch2-slurm/install-compute-node.html and set it up to run a "ping script":{{site.baseurl}}/install/install-compute-ping.html at boot.
33 Create a virtual machine image using the commands provided by your cloud provider. We recommend using a tool such as "Packer":https://www.packer.io/ to automate this process.
35 Configure node manager to use the image with the @image@ or @image_id@ parameter.
37 h2. Configure node manager
39 The configuration file at @/etc/arvados-node-manager/config.ini@ . Some configuration details are specific to the cloud provider you are using:
41 * "Amazon Web Services":#aws
42 * "Google Cloud Platform":#gcp
43 * "Microsoft Azure":#azure
45 h3(#aws). Amazon Web Services
48 # EC2 configuration for Arvados Node Manager.
49 # All times are in seconds unless specified otherwise.
52 # The dispatcher can customize the start and stop procedure for
53 # cloud nodes. For example, the SLURM dispatcher drains nodes
54 # through SLURM before shutting them down.
57 # Node Manager will ensure that there are at least this many nodes running at
58 # all times. If node manager needs to start new idle nodes for the purpose of
59 # satisfying min_nodes, it will use the cheapest node type. However, depending
60 # on usage patterns, it may also satisfy min_nodes by keeping alive some
61 # more-expensive nodes
64 # Node Manager will not start any compute nodes when at least this
68 # Upper limit on rate of spending (in $/hr), will not boot additional nodes
69 # if total price of already running nodes meets or exceeds this threshold.
70 # default 0 means no limit.
73 # Poll EC2 nodes and Arvados for new information every N seconds.
76 # Polls have exponential backoff when services fail to respond.
77 # This is the longest time to wait between polls.
80 # If Node Manager can't succesfully poll a service for this long,
81 # it will never start or stop compute nodes, on the assumption that its
82 # information is too outdated.
83 poll_stale_after = 600
85 # If Node Manager boots a cloud node, and it does not pair with an Arvados
86 # node before this long, assume that there was a cloud bootstrap failure and
87 # shut it down. Note that normal shutdown windows apply (see the Cloud
88 # section), so this should be shorter than the first shutdown window value.
89 boot_fail_after = 1800
91 # "Node stale time" affects two related behaviors.
92 # 1. If a compute node has been running for at least this long, but it
93 # isn't paired with an Arvados node, do not shut it down, but leave it alone.
94 # This prevents the node manager from shutting down a node that might
95 # actually be doing work, but is having temporary trouble contacting the
97 # 2. When the Node Manager starts a new compute node, it will try to reuse
98 # an Arvados node that hasn't been updated for this long.
99 node_stale_after = 14400
101 # Scaling factor to be applied to nodes' available RAM size. Usually there's a
102 # variable discrepancy between the advertised RAM value on cloud nodes and the
103 # actual amount available.
104 # If not set, this value will be set to 0.95
105 node_mem_scaling = 0.95
107 # File path for Certificate Authorities
108 certs_file = /etc/ssl/certs/ca-certificates.crt
112 file = /var/log/arvados/node-manager.log
114 # Log level for most Node Manager messages.
115 # Choose one of DEBUG, INFO, WARNING, ERROR, or CRITICAL.
116 # WARNING lets you know when polling a service fails.
117 # INFO additionally lets you know when a compute node is started or stopped.
120 # You can also set different log levels for specific libraries.
121 # Pykka is the Node Manager's actor library.
122 # Setting this to DEBUG will display tracebacks for uncaught
123 # exceptions in the actors, but it's also very chatty.
126 # Setting apiclient to INFO will log the URL of every Arvados API request.
130 host = zyxwv.arvadosapi.com
131 token = ARVADOS_TOKEN
134 # Accept an untrusted SSL certificate from the API server?
140 # It's usually most cost-effective to shut down compute nodes during narrow
141 # windows of time. For example, EC2 bills each node by the hour, so the best
142 # time to shut down a node is right before a new hour of uptime starts.
143 # Shutdown windows define these periods of time. These are windows in
144 # full minutes, separated by commas. Counting from the time the node is
145 # booted, the node WILL NOT shut down for N1 minutes; then it MAY shut down
146 # for N2 minutes; then it WILL NOT shut down for N3 minutes; and so on.
147 # For example, "54, 5, 1" means the node may shut down from the 54th to the
148 # 59th minute of each hour of uptime.
149 # Specify at least two windows. You can add as many as you need beyond that.
150 shutdown_windows = 54, 5, 1
159 # This section defines filters that find compute nodes.
160 # Tags that you specify here will automatically be added to nodes you create.
161 # Replace colons in Amazon filters with underscores
162 # (e.g., write "tag:mytag" as "tag_mytag").
163 instance-state-name = running
164 tag_arvados-class = dynamic-compute
168 # New compute nodes will send pings to Arvados at this host.
169 # You may specify a port, and use brackets to disambiguate IPv6 addresses.
170 ping_host = hostname:port
172 # Give the name of an SSH key on AWS...
175 # ... or a file path for an SSH key that can log in to the compute node.
176 # (One or the other, not both.)
179 # The EC2 IDs of the image and subnet compute nodes should use.
183 # Comma-separated EC2 IDs for the security group(s) assigned to each
185 security_groups = idstring1, idstring2
188 # You can define any number of Size sections to list EC2 sizes you're
189 # willing to use. The Node Manager should boot the cheapest size(s) that
190 # can run jobs in the queue.
192 # Each size section MUST define the number of cores are available in this
193 # size class (since libcloud does not provide any consistent API for exposing
195 # You may also want to define the amount of scratch space (expressed
196 # in GB) for Crunch jobs. You can also override Amazon's provided
197 # data fields (such as price per hour) by setting them here.
210 h3(#gcp). Google Cloud Platform
213 # Google Compute Engine configuration for Arvados Node Manager.
214 # All times are in seconds unless specified otherwise.
217 # Node Manager will ensure that there are at least this many nodes running at
218 # all times. If node manager needs to start new idle nodes for the purpose of
219 # satisfying min_nodes, it will use the cheapest node type. However, depending
220 # on usage patterns, it may also satisfy min_nodes by keeping alive some
221 # more-expensive nodes
224 # Node Manager will not start any compute nodes when at least this
225 # running at all times. By default, these will be the cheapest node size.
228 # Poll compute nodes and Arvados for new information every N seconds.
231 # Upper limit on rate of spending (in $/hr), will not boot additional nodes
232 # if total price of already running nodes meets or exceeds this threshold.
233 # default 0 means no limit.
236 # Polls have exponential backoff when services fail to respond.
237 # This is the longest time to wait between polls.
240 # If Node Manager can't succesfully poll a service for this long,
241 # it will never start or stop compute nodes, on the assumption that its
242 # information is too outdated.
243 poll_stale_after = 600
245 # "Node stale time" affects two related behaviors.
246 # 1. If a compute node has been running for at least this long, but it
247 # isn't paired with an Arvados node, do not shut it down, but leave it alone.
248 # This prevents the node manager from shutting down a node that might
249 # actually be doing work, but is having temporary trouble contacting the
251 # 2. When the Node Manager starts a new compute node, it will try to reuse
252 # an Arvados node that hasn't been updated for this long.
253 node_stale_after = 14400
255 # Scaling factor to be applied to nodes' available RAM size. Usually there's a
256 # variable discrepancy between the advertised RAM value on cloud nodes and the
257 # actual amount available.
258 # If not set, this value will be set to 0.95
259 node_mem_scaling = 0.95
261 # File path for Certificate Authorities
262 certs_file = /etc/ssl/certs/ca-certificates.crt
266 file = /var/log/arvados/node-manager.log
268 # Log level for most Node Manager messages.
269 # Choose one of DEBUG, INFO, WARNING, ERROR, or CRITICAL.
270 # WARNING lets you know when polling a service fails.
271 # INFO additionally lets you know when a compute node is started or stopped.
274 # You can also set different log levels for specific libraries.
275 # Pykka is the Node Manager's actor library.
276 # Setting this to DEBUG will display tracebacks for uncaught
277 # exceptions in the actors, but it's also very chatty.
280 # Setting apiclient to INFO will log the URL of every Arvados API request.
284 host = zyxwv.arvadosapi.com
285 token = ARVADOS_TOKEN
288 # Accept an untrusted SSL certificate from the API server?
294 # Shutdown windows define periods of time when a node may and may not
295 # be shut down. These are windows in full minutes, separated by
296 # commas. Counting from the time the node is booted, the node WILL
297 # NOT shut down for N1 minutes; then it MAY shut down for N2 minutes;
298 # then it WILL NOT shut down for N3 minutes; and so on. For example,
299 # "54, 5, 1" means the node may shut down from the 54th to the 59th
300 # minute of each hour of uptime.
301 # GCE bills by the minute, and does not provide information about when
302 # a node booted. Node Manager will store this information in metadata
303 # when it boots a node; if that information is not available, it will
304 # assume the node booted at the epoch. These shutdown settings are
305 # very aggressive. You may want to adjust this if you want more
306 # continuity of service from a single node.
307 shutdown_windows = 20, 999999
310 user_id = client_email_address@developer.gserviceaccount.com
311 key = path_to_certificate.pem
312 project = project-id-from-google-cloud-dashboard
315 # Valid location (zone) names: https://cloud.google.com/compute/docs/zones
316 datacenter = us-central1-a
318 # Optional settings. For full documentation see
319 # http://libcloud.readthedocs.org/en/latest/compute/drivers/gce.html#libcloud.compute.drivers.gce.GCENodeDriver
321 # auth_type = SA # SA, IA or GCE
322 # scopes = https://www.googleapis.com/auth/compute
326 # A comma-separated list of tags that must be applied to a node for it to
327 # be considered a compute node.
328 # The driver will automatically apply these tags to nodes it creates.
329 tags = zyxwv, compute
332 # New compute nodes will send pings to Arvados at this host.
333 # You may specify a port, and use brackets to disambiguate IPv6 addresses.
334 ping_host = hostname:port
336 # A file path for an SSH key that can log in to the compute node.
339 # The GCE image name and network zone name to use when creating new nodes.
341 # network = your_network_name
343 # JSON string of service account authorizations for this cluster.
344 # See http://libcloud.readthedocs.org/en/latest/compute/drivers/gce.html#specifying-service-account-scopes
345 # service_accounts = [{'email':'account@example.com', 'scopes':['storage-ro']}]
348 # You can define any number of Size sections to list node sizes you're
349 # willing to use. The Node Manager should boot the cheapest size(s) that
350 # can run jobs in the queue.
352 # The Size fields are interpreted the same way as with a libcloud NodeSize:
353 # http://libcloud.readthedocs.org/en/latest/compute/api.html#libcloud.compute.base.NodeSize
355 # See https://cloud.google.com/compute/docs/machine-types for a list
356 # of known machine types that may be used as a Size parameter.
358 # Each size section MUST define the number of cores are available in this
359 # size class (since libcloud does not provide any consistent API for exposing
361 # You may also want to define the amount of scratch space (expressed
362 # in GB) for Crunch jobs.
363 # You can also override Google's provided data fields (such as price per hour)
364 # by setting them here.
377 h3(#azure). Microsoft Azure
380 # Azure configuration for Arvados Node Manager.
381 # All times are in seconds unless specified otherwise.
384 # The dispatcher can customize the start and stop procedure for
385 # cloud nodes. For example, the SLURM dispatcher drains nodes
386 # through SLURM before shutting them down.
389 # Node Manager will ensure that there are at least this many nodes running at
390 # all times. If node manager needs to start new idle nodes for the purpose of
391 # satisfying min_nodes, it will use the cheapest node type. However, depending
392 # on usage patterns, it may also satisfy min_nodes by keeping alive some
393 # more-expensive nodes
396 # Node Manager will not start any compute nodes when at least this
400 # Upper limit on rate of spending (in $/hr), will not boot additional nodes
401 # if total price of already running nodes meets or exceeds this threshold.
402 # default 0 means no limit.
405 # Poll Azure nodes and Arvados for new information every N seconds.
408 # Polls have exponential backoff when services fail to respond.
409 # This is the longest time to wait between polls.
412 # If Node Manager can't succesfully poll a service for this long,
413 # it will never start or stop compute nodes, on the assumption that its
414 # information is too outdated.
415 poll_stale_after = 600
417 # If Node Manager boots a cloud node, and it does not pair with an Arvados
418 # node before this long, assume that there was a cloud bootstrap failure and
419 # shut it down. Note that normal shutdown windows apply (see the Cloud
420 # section), so this should be shorter than the first shutdown window value.
421 boot_fail_after = 1800
423 # "Node stale time" affects two related behaviors.
424 # 1. If a compute node has been running for at least this long, but it
425 # isn't paired with an Arvados node, do not shut it down, but leave it alone.
426 # This prevents the node manager from shutting down a node that might
427 # actually be doing work, but is having temporary trouble contacting the
429 # 2. When the Node Manager starts a new compute node, it will try to reuse
430 # an Arvados node that hasn't been updated for this long.
431 node_stale_after = 14400
433 # Scaling factor to be applied to nodes' available RAM size. Usually there's a
434 # variable discrepancy between the advertised RAM value on cloud nodes and the
435 # actual amount available.
436 # If not set, this value will be set to 0.95
437 node_mem_scaling = 0.95
439 # File path for Certificate Authorities
440 certs_file = /etc/ssl/certs/ca-certificates.crt
444 file = /var/log/arvados/node-manager.log
446 # Log level for most Node Manager messages.
447 # Choose one of DEBUG, INFO, WARNING, ERROR, or CRITICAL.
448 # WARNING lets you know when polling a service fails.
449 # INFO additionally lets you know when a compute node is started or stopped.
452 # You can also set different log levels for specific libraries.
453 # Pykka is the Node Manager's actor library.
454 # Setting this to DEBUG will display tracebacks for uncaught
455 # exceptions in the actors, but it's also very chatty.
458 # Setting apiclient to INFO will log the URL of every Arvados API request.
462 host = zyxwv.arvadosapi.com
463 token = ARVADOS_TOKEN
466 # Accept an untrusted SSL certificate from the API server?
472 # Shutdown windows define periods of time when a node may and may not be shut
473 # down. These are windows in full minutes, separated by commas. Counting from
474 # the time the node is booted, the node WILL NOT shut down for N1 minutes; then
475 # it MAY shut down for N2 minutes; then it WILL NOT shut down for N3 minutes;
476 # and so on. For example, "20, 999999" means the node may shut down between
477 # the 20th and 999999th minutes of uptime.
478 # Azure bills by the minute, so it makes sense to agressively shut down idle
479 # nodes. Specify at least two windows. You can add as many as you need beyond
481 shutdown_windows = 20, 999999
484 # Use "azure account list" with the azure CLI to get these values.
485 tenant_id = 00000000-0000-0000-0000-000000000000
486 subscription_id = 00000000-0000-0000-0000-000000000000
488 # The following directions are based on
489 # https://azure.microsoft.com/en-us/documentation/articles/resource-group-authenticate-service-principal/
491 # azure config mode arm
492 # azure ad app create --name "<Your Application Display Name>" --home-page "<https://YourApplicationHomePage>" --identifier-uris "<https://YouApplicationUri>" --password <Your_Password>
493 # azure ad sp create "<Application_Id>"
494 # azure role assignment create --objectId "<Object_Id>" -o Owner -c /subscriptions/{subscriptionId}/
496 # Use <Application_Id> for "key" and the <Your_Password> for "secret"
498 key = 00000000-0000-0000-0000-000000000000
504 # The resource group in which the compute node virtual machines will be created
506 ex_resource_group = ArvadosResourceGroup
509 # The compute node image, as a link to a VHD in Azure blob store.
510 image = https://example.blob.core.windows.net/system/Microsoft.Compute/Images/images/zyxwv-compute-osDisk.vhd
512 # Path to a local ssh key file that will be used to provision new nodes.
513 ssh_key = /home/arvadosuser/.ssh/id_rsa.pub
515 # The account name for the admin user that will be provisioned on new nodes.
516 ex_user_name = arvadosuser
518 # The Azure storage account that will be used to store the node OS disk images.
519 ex_storage_account = arvadosstorage
521 # The virtual network the VMs will be associated with.
522 ex_network = ArvadosNetwork
524 # Optional subnet of the virtual network.
528 tag_arvados-class = dynamic-compute
531 # the API server to ping
532 ping_host = hostname:port
534 # You can define any number of Size sections to list Azure sizes you're willing
535 # to use. The Node Manager should boot the cheapest size(s) that can run jobs
536 # in the queue. You must also provide price per hour as the Azure driver
537 # compute currently does not report prices.
539 # See https://azure.microsoft.com/en-us/pricing/details/virtual-machines/
540 # for a list of known machine types that may be used as a Size parameter.
542 # Each size section MUST define the number of cores are available in this
543 # size class (since libcloud does not provide any consistent API for exposing
545 # You may also want to define the amount of scratch space (expressed
546 # in GB) for Crunch jobs. You can also override Microsoft's provided
547 # data fields by setting them here.
561 $ arvados-node-manager --config /etc/arvados-node-manager/config.ini