doc/install/install-nodemanager.html.textile.liquid

   1 ---
   2 layout: default
   3 navsection: installguide
   4 title: Install Node Manager
   5 ...
   6
   7 Arvados Node Manager provides elastic computing for Arvados and SLURM by creating and destroying virtual machines on demand.  Node Manager currently supports Amazon Web Services (AWS), Google Cloud Platform (GCP) and Microsoft Azure.
   8
   9 Note: node manager is only required for elastic computing cloud environments.  Fixed size clusters do not require node manager.
  10
  11 h2. Install
  12
  13 Node manager may run anywhere, however it must be able to communicate with the cloud provider's APIs, and use the command line tools @sinfo@, @squeue@ and @scontrol@ to communicate with the cluster's SLURM controller.
  14
  15 On Debian-based systems:
  16
  17 <notextile>
  18 <pre><code>~$ <span class="userinput">sudo apt-get install arvados-node-manager</span>
  19 </code></pre>
  20 </notextile>
  21
  22 On Red Hat-based systems:
  23
  24 <notextile>
  25 <pre><code>~$ <span class="userinput">sudo yum install arvados-node-manager</span>
  26 </code></pre>
  27 </notextile>
  28
  29 h2. Create compute image
  30
  31 Configure a virtual machine following the "instructions to set up a compute node.":{{site.baseurl}}/install/crunch2-slurm/install-compute-node.html and set it up to run a "ping script":{{site.baseurl}}/install/install-compute-ping.html at boot.
  32
  33 Create a virtual machine image using the commands provided by your cloud provider.  We recommend using a tool such as "Packer":https://www.packer.io/ to automate this process.
  34
  35 Configure node manager to use the image with the @image@ or @image_id@ parameter.
  36
  37 h2. Configure node manager
  38
  39 The configuration file at @/etc/arvados-node-manager/config.ini@ .  Some configuration details are specific to the cloud provider you are using:
  40
  41 * "Amazon Web Services":#aws
  42 * "Google Cloud Platform":#gcp
  43 * "Microsoft Azure":#azure
  44
  45 h3(#aws). Amazon Web Services
  46
  47 <pre>
  48 # EC2 configuration for Arvados Node Manager.
  49 # All times are in seconds unless specified otherwise.
  50
  51 [Daemon]
  52 # The dispatcher can customize the start and stop procedure for
  53 # cloud nodes.  For example, the SLURM dispatcher drains nodes
  54 # through SLURM before shutting them down.
  55 dispatcher = slurm
  56
  57 # Node Manager will ensure that there are at least this many nodes running at
  58 # all times.  If node manager needs to start new idle nodes for the purpose of
  59 # satisfying min_nodes, it will use the cheapest node type.  However, depending
  60 # on usage patterns, it may also satisfy min_nodes by keeping alive some
  61 # more-expensive nodes
  62 min_nodes = 0
  63
  64 # Node Manager will not start any compute nodes when at least this
  65 # many are running.
  66 max_nodes = 8
  67
  68 # Upper limit on rate of spending (in $/hr), will not boot additional nodes
  69 # if total price of already running nodes meets or exceeds this threshold.
  70 # default 0 means no limit.
  71 max_total_price = 0
  72
  73 # Poll EC2 nodes and Arvados for new information every N seconds.
  74 poll_time = 60
  75
  76 # Polls have exponential backoff when services fail to respond.
  77 # This is the longest time to wait between polls.
  78 max_poll_time = 300
  79
  80 # If Node Manager can't succesfully poll a service for this long,
  81 # it will never start or stop compute nodes, on the assumption that its
  82 # information is too outdated.
  83 poll_stale_after = 600
  84
  85 # If Node Manager boots a cloud node, and it does not pair with an Arvados
  86 # node before this long, assume that there was a cloud bootstrap failure and
  87 # shut it down.  Note that normal shutdown windows apply (see the Cloud
  88 # section), so this should be shorter than the first shutdown window value.
  89 boot_fail_after = 1800
  90
  91 # "Node stale time" affects two related behaviors.
  92 # 1. If a compute node has been running for at least this long, but it
  93 # isn't paired with an Arvados node, do not shut it down, but leave it alone.
  94 # This prevents the node manager from shutting down a node that might
  95 # actually be doing work, but is having temporary trouble contacting the
  96 # API server.
  97 # 2. When the Node Manager starts a new compute node, it will try to reuse
  98 # an Arvados node that hasn't been updated for this long.
  99 node_stale_after = 14400
 100
 101 # Scaling factor to be applied to nodes' available RAM size. Usually there's a
 102 # variable discrepancy between the advertised RAM value on cloud nodes and the
 103 # actual amount available.
 104 # If not set, this value will be set to 0.95
 105 node_mem_scaling = 0.95
 106
 107 # File path for Certificate Authorities
 108 certs_file = /etc/ssl/certs/ca-certificates.crt
 109
 110 [Logging]
 111 # Log file path
 112 file = /var/log/arvados/node-manager.log
 113
 114 # Log level for most Node Manager messages.
 115 # Choose one of DEBUG, INFO, WARNING, ERROR, or CRITICAL.
 116 # WARNING lets you know when polling a service fails.
 117 # INFO additionally lets you know when a compute node is started or stopped.
 118 level = INFO
 119
 120 # You can also set different log levels for specific libraries.
 121 # Pykka is the Node Manager's actor library.
 122 # Setting this to DEBUG will display tracebacks for uncaught
 123 # exceptions in the actors, but it's also very chatty.
 124 pykka = WARNING
 125
 126 # Setting apiclient to INFO will log the URL of every Arvados API request.
 127 apiclient = WARNING
 128
 129 [Arvados]
 130 host = zyxwv.arvadosapi.com
 131 token = ARVADOS_TOKEN
 132 timeout = 15
 133
 134 # Accept an untrusted SSL certificate from the API server?
 135 insecure = no
 136
 137 [Cloud]
 138 provider = ec2
 139
 140 # It's usually most cost-effective to shut down compute nodes during narrow
 141 # windows of time.  For example, EC2 bills each node by the hour, so the best
 142 # time to shut down a node is right before a new hour of uptime starts.
 143 # Shutdown windows define these periods of time.  These are windows in
 144 # full minutes, separated by commas.  Counting from the time the node is
 145 # booted, the node WILL NOT shut down for N1 minutes; then it MAY shut down
 146 # for N2 minutes; then it WILL NOT shut down for N3 minutes; and so on.
 147 # For example, "54, 5, 1" means the node may shut down from the 54th to the
 148 # 59th minute of each hour of uptime.
 149 # Specify at least two windows.  You can add as many as you need beyond that.
 150 shutdown_windows = 54, 5, 1
 151
 152 [Cloud Credentials]
 153 key = KEY
 154 secret = SECRET_KEY
 155 region = us-east-1
 156 timeout = 60
 157
 158 [Cloud List]
 159 # This section defines filters that find compute nodes.
 160 # Tags that you specify here will automatically be added to nodes you create.
 161 # Replace colons in Amazon filters with underscores
 162 # (e.g., write "tag:mytag" as "tag_mytag").
 163 instance-state-name = running
 164 tag_arvados-class = dynamic-compute
 165 tag_cluster = zyxwv
 166
 167 [Cloud Create]
 168 # New compute nodes will send pings to Arvados at this host.
 169 # You may specify a port, and use brackets to disambiguate IPv6 addresses.
 170 ping_host = hostname:port
 171
 172 # Give the name of an SSH key on AWS...
 173 ex_keyname = string
 174
 175 # ... or a file path for an SSH key that can log in to the compute node.
 176 # (One or the other, not both.)
 177 # ssh_key = path
 178
 179 # The EC2 IDs of the image and subnet compute nodes should use.
 180 image_id = idstring
 181 subnet_id = idstring
 182
 183 # Comma-separated EC2 IDs for the security group(s) assigned to each
 184 # compute node.
 185 security_groups = idstring1, idstring2
 186
 187
 188 # You can define any number of Size sections to list EC2 sizes you're
 189 # willing to use.  The Node Manager should boot the cheapest size(s) that
 190 # can run jobs in the queue.
 191 #
 192 # Each size section MUST define the number of cores are available in this
 193 # size class (since libcloud does not provide any consistent API for exposing
 194 # this setting).
 195 # You may also want to define the amount of scratch space (expressed
 196 # in GB) for Crunch jobs.  You can also override Amazon's provided
 197 # data fields (such as price per hour) by setting them here.
 198
 199 [Size m4.large]
 200 cores = 2
 201 price = 0.126
 202 scratch = 100
 203
 204 [Size m4.xlarge]
 205 cores = 4
 206 price = 0.252
 207 scratch = 100
 208 </pre>
 209
 210 h3(#gcp). Google Cloud Platform
 211
 212 <pre>
 213 # Google Compute Engine configuration for Arvados Node Manager.
 214 # All times are in seconds unless specified otherwise.
 215
 216 [Daemon]
 217 # Node Manager will ensure that there are at least this many nodes running at
 218 # all times.  If node manager needs to start new idle nodes for the purpose of
 219 # satisfying min_nodes, it will use the cheapest node type.  However, depending
 220 # on usage patterns, it may also satisfy min_nodes by keeping alive some
 221 # more-expensive nodes
 222 min_nodes = 0
 223
 224 # Node Manager will not start any compute nodes when at least this
 225 # running at all times.  By default, these will be the cheapest node size.
 226 max_nodes = 8
 227
 228 # Poll compute nodes and Arvados for new information every N seconds.
 229 poll_time = 60
 230
 231 # Upper limit on rate of spending (in $/hr), will not boot additional nodes
 232 # if total price of already running nodes meets or exceeds this threshold.
 233 # default 0 means no limit.
 234 max_total_price = 0
 235
 236 # Polls have exponential backoff when services fail to respond.
 237 # This is the longest time to wait between polls.
 238 max_poll_time = 300
 239
 240 # If Node Manager can't succesfully poll a service for this long,
 241 # it will never start or stop compute nodes, on the assumption that its
 242 # information is too outdated.
 243 poll_stale_after = 600
 244
 245 # "Node stale time" affects two related behaviors.
 246 # 1. If a compute node has been running for at least this long, but it
 247 # isn't paired with an Arvados node, do not shut it down, but leave it alone.
 248 # This prevents the node manager from shutting down a node that might
 249 # actually be doing work, but is having temporary trouble contacting the
 250 # API server.
 251 # 2. When the Node Manager starts a new compute node, it will try to reuse
 252 # an Arvados node that hasn't been updated for this long.
 253 node_stale_after = 14400
 254
 255 # Scaling factor to be applied to nodes' available RAM size. Usually there's a
 256 # variable discrepancy between the advertised RAM value on cloud nodes and the
 257 # actual amount available.
 258 # If not set, this value will be set to 0.95
 259 node_mem_scaling = 0.95
 260
 261 # File path for Certificate Authorities
 262 certs_file = /etc/ssl/certs/ca-certificates.crt
 263
 264 [Logging]
 265 # Log file path
 266 file = /var/log/arvados/node-manager.log
 267
 268 # Log level for most Node Manager messages.
 269 # Choose one of DEBUG, INFO, WARNING, ERROR, or CRITICAL.
 270 # WARNING lets you know when polling a service fails.
 271 # INFO additionally lets you know when a compute node is started or stopped.
 272 level = INFO
 273
 274 # You can also set different log levels for specific libraries.
 275 # Pykka is the Node Manager's actor library.
 276 # Setting this to DEBUG will display tracebacks for uncaught
 277 # exceptions in the actors, but it's also very chatty.
 278 pykka = WARNING
 279
 280 # Setting apiclient to INFO will log the URL of every Arvados API request.
 281 apiclient = WARNING
 282
 283 [Arvados]
 284 host = zyxwv.arvadosapi.com
 285 token = ARVADOS_TOKEN
 286 timeout = 15
 287
 288 # Accept an untrusted SSL certificate from the API server?
 289 insecure = no
 290
 291 [Cloud]
 292 provider = gce
 293
 294 # Shutdown windows define periods of time when a node may and may not
 295 # be shut down.  These are windows in full minutes, separated by
 296 # commas.  Counting from the time the node is booted, the node WILL
 297 # NOT shut down for N1 minutes; then it MAY shut down for N2 minutes;
 298 # then it WILL NOT shut down for N3 minutes; and so on.  For example,
 299 # "54, 5, 1" means the node may shut down from the 54th to the 59th
 300 # minute of each hour of uptime.
 301 # GCE bills by the minute, and does not provide information about when
 302 # a node booted.  Node Manager will store this information in metadata
 303 # when it boots a node; if that information is not available, it will
 304 # assume the node booted at the epoch.  These shutdown settings are
 305 # very aggressive.  You may want to adjust this if you want more
 306 # continuity of service from a single node.
 307 shutdown_windows = 20, 999999
 308
 309 [Cloud Credentials]
 310 user_id = client_email_address@developer.gserviceaccount.com
 311 key = path_to_certificate.pem
 312 project = project-id-from-google-cloud-dashboard
 313 timeout = 60
 314
 315 # Valid location (zone) names: https://cloud.google.com/compute/docs/zones
 316 datacenter = us-central1-a
 317
 318 # Optional settings. For full documentation see
 319 # http://libcloud.readthedocs.org/en/latest/compute/drivers/gce.html#libcloud.compute.drivers.gce.GCENodeDriver
 320 #
 321 # auth_type = SA               # SA, IA or GCE
 322 # scopes = https://www.googleapis.com/auth/compute
 323 # credential_file =
 324
 325 [Cloud List]
 326 # A comma-separated list of tags that must be applied to a node for it to
 327 # be considered a compute node.
 328 # The driver will automatically apply these tags to nodes it creates.
 329 tags = zyxwv, compute
 330
 331 [Cloud Create]
 332 # New compute nodes will send pings to Arvados at this host.
 333 # You may specify a port, and use brackets to disambiguate IPv6 addresses.
 334 ping_host = hostname:port
 335
 336 # A file path for an SSH key that can log in to the compute node.
 337 # ssh_key = path
 338
 339 # The GCE image name and network zone name to use when creating new nodes.
 340 image = debian-7
 341 # network = your_network_name
 342
 343 # JSON string of service account authorizations for this cluster.
 344 # See http://libcloud.readthedocs.org/en/latest/compute/drivers/gce.html#specifying-service-account-scopes
 345 # service_accounts = [{'email':'account@example.com', 'scopes':['storage-ro']}]
 346
 347
 348 # You can define any number of Size sections to list node sizes you're
 349 # willing to use.  The Node Manager should boot the cheapest size(s) that
 350 # can run jobs in the queue.
 351 #
 352 # The Size fields are interpreted the same way as with a libcloud NodeSize:
 353 # http://libcloud.readthedocs.org/en/latest/compute/api.html#libcloud.compute.base.NodeSize
 354 #
 355 # See https://cloud.google.com/compute/docs/machine-types for a list
 356 # of known machine types that may be used as a Size parameter.
 357 #
 358 # Each size section MUST define the number of cores are available in this
 359 # size class (since libcloud does not provide any consistent API for exposing
 360 # this setting).
 361 # You may also want to define the amount of scratch space (expressed
 362 # in GB) for Crunch jobs.
 363 # You can also override Google's provided data fields (such as price per hour)
 364 # by setting them here.
 365
 366 [Size n1-standard-2]
 367 cores = 2
 368 price = 0.076
 369 scratch = 100
 370
 371 [Size n1-standard-4]
 372 cores = 4
 373 price = 0.152
 374 scratch = 200
 375 </pre>
 376
 377 h3(#azure). Microsoft Azure
 378
 379 <pre>
 380 # Azure configuration for Arvados Node Manager.
 381 # All times are in seconds unless specified otherwise.
 382
 383 [Daemon]
 384 # The dispatcher can customize the start and stop procedure for
 385 # cloud nodes.  For example, the SLURM dispatcher drains nodes
 386 # through SLURM before shutting them down.
 387 dispatcher = slurm
 388
 389 # Node Manager will ensure that there are at least this many nodes running at
 390 # all times.  If node manager needs to start new idle nodes for the purpose of
 391 # satisfying min_nodes, it will use the cheapest node type.  However, depending
 392 # on usage patterns, it may also satisfy min_nodes by keeping alive some
 393 # more-expensive nodes
 394 min_nodes = 0
 395
 396 # Node Manager will not start any compute nodes when at least this
 397 # many are running.
 398 max_nodes = 8
 399
 400 # Upper limit on rate of spending (in $/hr), will not boot additional nodes
 401 # if total price of already running nodes meets or exceeds this threshold.
 402 # default 0 means no limit.
 403 max_total_price = 0
 404
 405 # Poll Azure nodes and Arvados for new information every N seconds.
 406 poll_time = 60
 407
 408 # Polls have exponential backoff when services fail to respond.
 409 # This is the longest time to wait between polls.
 410 max_poll_time = 300
 411
 412 # If Node Manager can't succesfully poll a service for this long,
 413 # it will never start or stop compute nodes, on the assumption that its
 414 # information is too outdated.
 415 poll_stale_after = 600
 416
 417 # If Node Manager boots a cloud node, and it does not pair with an Arvados
 418 # node before this long, assume that there was a cloud bootstrap failure and
 419 # shut it down.  Note that normal shutdown windows apply (see the Cloud
 420 # section), so this should be shorter than the first shutdown window value.
 421 boot_fail_after = 1800
 422
 423 # "Node stale time" affects two related behaviors.
 424 # 1. If a compute node has been running for at least this long, but it
 425 # isn't paired with an Arvados node, do not shut it down, but leave it alone.
 426 # This prevents the node manager from shutting down a node that might
 427 # actually be doing work, but is having temporary trouble contacting the
 428 # API server.
 429 # 2. When the Node Manager starts a new compute node, it will try to reuse
 430 # an Arvados node that hasn't been updated for this long.
 431 node_stale_after = 14400
 432
 433 # Scaling factor to be applied to nodes' available RAM size. Usually there's a
 434 # variable discrepancy between the advertised RAM value on cloud nodes and the
 435 # actual amount available.
 436 # If not set, this value will be set to 0.95
 437 node_mem_scaling = 0.95
 438
 439 # File path for Certificate Authorities
 440 certs_file = /etc/ssl/certs/ca-certificates.crt
 441
 442 [Logging]
 443 # Log file path
 444 file = /var/log/arvados/node-manager.log
 445
 446 # Log level for most Node Manager messages.
 447 # Choose one of DEBUG, INFO, WARNING, ERROR, or CRITICAL.
 448 # WARNING lets you know when polling a service fails.
 449 # INFO additionally lets you know when a compute node is started or stopped.
 450 level = INFO
 451
 452 # You can also set different log levels for specific libraries.
 453 # Pykka is the Node Manager's actor library.
 454 # Setting this to DEBUG will display tracebacks for uncaught
 455 # exceptions in the actors, but it's also very chatty.
 456 pykka = WARNING
 457
 458 # Setting apiclient to INFO will log the URL of every Arvados API request.
 459 apiclient = WARNING
 460
 461 [Arvados]
 462 host = zyxwv.arvadosapi.com
 463 token = ARVADOS_TOKEN
 464 timeout = 15
 465
 466 # Accept an untrusted SSL certificate from the API server?
 467 insecure = no
 468
 469 [Cloud]
 470 provider = azure
 471
 472 # Shutdown windows define periods of time when a node may and may not be shut
 473 # down.  These are windows in full minutes, separated by commas.  Counting from
 474 # the time the node is booted, the node WILL NOT shut down for N1 minutes; then
 475 # it MAY shut down for N2 minutes; then it WILL NOT shut down for N3 minutes;
 476 # and so on.  For example, "20, 999999" means the node may shut down between
 477 # the 20th and 999999th minutes of uptime.
 478 # Azure bills by the minute, so it makes sense to agressively shut down idle
 479 # nodes.  Specify at least two windows.  You can add as many as you need beyond
 480 # that.
 481 shutdown_windows = 20, 999999
 482
 483 [Cloud Credentials]
 484 # Use "azure account list" with the azure CLI to get these values.
 485 tenant_id = 00000000-0000-0000-0000-000000000000
 486 subscription_id = 00000000-0000-0000-0000-000000000000
 487
 488 # The following directions are based on
 489 # https://azure.microsoft.com/en-us/documentation/articles/resource-group-authenticate-service-principal/
 490 #
 491 # azure config mode arm
 492 # azure ad app create --name "<Your Application Display Name>" --home-page "<https://YourApplicationHomePage>" --identifier-uris "<https://YouApplicationUri>" --password <Your_Password>
 493 # azure ad sp create "<Application_Id>"
 494 # azure role assignment create --objectId "<Object_Id>" -o Owner -c /subscriptions/{subscriptionId}/
 495 #
 496 # Use <Application_Id> for "key" and the <Your_Password> for "secret"
 497 #
 498 key = 00000000-0000-0000-0000-000000000000
 499 secret = PASSWORD
 500 timeout = 60
 501 region = East US
 502
 503 [Cloud List]
 504 # The resource group in which the compute node virtual machines will be created
 505 # and listed.
 506 ex_resource_group = ArvadosResourceGroup
 507
 508 [Cloud Create]
 509 # The compute node image, as a link to a VHD in Azure blob store.
 510 image = https://example.blob.core.windows.net/system/Microsoft.Compute/Images/images/zyxwv-compute-osDisk.vhd
 511
 512 # Path to a local ssh key file that will be used to provision new nodes.
 513 ssh_key = /home/arvadosuser/.ssh/id_rsa.pub
 514
 515 # The account name for the admin user that will be provisioned on new nodes.
 516 ex_user_name = arvadosuser
 517
 518 # The Azure storage account that will be used to store the node OS disk images.
 519 ex_storage_account = arvadosstorage
 520
 521 # The virtual network the VMs will be associated with.
 522 ex_network = ArvadosNetwork
 523
 524 # Optional subnet of the virtual network.
 525 #ex_subnet = default
 526
 527 # Node tags
 528 tag_arvados-class = dynamic-compute
 529 tag_cluster = zyxwv
 530
 531 # the API server to ping
 532 ping_host = hostname:port
 533
 534 # You can define any number of Size sections to list Azure sizes you're willing
 535 # to use.  The Node Manager should boot the cheapest size(s) that can run jobs
 536 # in the queue.  You must also provide price per hour as the Azure driver
 537 # compute currently does not report prices.
 538 #
 539 # See https://azure.microsoft.com/en-us/pricing/details/virtual-machines/
 540 # for a list of known machine types that may be used as a Size parameter.
 541 #
 542 # Each size section MUST define the number of cores are available in this
 543 # size class (since libcloud does not provide any consistent API for exposing
 544 # this setting).
 545 # You may also want to define the amount of scratch space (expressed
 546 # in GB) for Crunch jobs.  You can also override Microsoft's provided
 547 # data fields by setting them here.
 548
 549 [Size Standard_D3]
 550 cores = 4
 551 price = 0.56
 552
 553 [Size Standard_D4]
 554 cores = 8
 555 price = 1.12
 556 </pre>
 557
 558 h2. Running
 559
 560 <pre>
 561 $ arvados-node-manager --config /etc/arvados-node-manager/config.ini
 562 </pre>