lib/dispatchcloud/readme.go

   1 // Copyright (C) The Arvados Authors. All rights reserved.
   2 //
   3 // SPDX-License-Identifier: AGPL-3.0
   4
   5 package dispatchcloud
   6
   7 // A dispatcher comprises a container queue, a scheduler, a worker
   8 // pool, a cloud provider, a stale-lock fixer, and a syncer.
   9 // 1. Choose a provider.
  10 // 2. Start a worker pool.
  11 // 3. Start a container queue.
  12 // 4. Run a stale-lock fixer.
  13 // 5. Start a scheduler.
  14 // 6. Start a syncer.
  15 //
  16 //
  17 // A provider (cloud driver) creates new cloud VM instances and gets
  18 // the latest list of instances. The returned instances implement
  19 // proxies to the provider's metadata and control interfaces (get IP
  20 // address, update tags, shutdown).
  21 //
  22 //
  23 // A workerPool tracks workers' instance types and readiness states
  24 // (available to do work now, booting, suffering a temporary network
  25 // outage, shutting down). It loads internal state from the cloud
  26 // provider's list of instances at startup, and syncs periodically
  27 // after that.
  28 //
  29 //
  30 // A worker maintains a multiplexed SSH connection to a cloud
  31 // instance, retrying/reconnecting as needed, so the workerPool can
  32 // execute commands. It asks the provider's instance to verify its SSH
  33 // public key once when first connecting, and again later if the key
  34 // changes.
  35 //
  36 //
  37 // A container queue tracks the known state (according to
  38 // arvados-controller) of each container of interest -- i.e., queued,
  39 // or locked/running using our own dispatch token. It also proxies the
  40 // dispatcher's lock/unlock/cancel requests to the controller. It
  41 // handles concurrent refresh and update operations without exposing
  42 // out-of-order updates to its callers. (It drops any new information
  43 // that might have originated before its own most recent
  44 // lock/unlock/cancel operation.)
  45 //
  46 //
  47 // A stale-lock fixer waits for any already-locked containers (i.e.,
  48 // locked by a prior server process) to appear on workers as the
  49 // worker pool recovers its state. It unlocks/requeues any that still
  50 // remain when all workers are recovered or shutdown, or its timer
  51 // expires.
  52 //
  53 //
  54 // A scheduler chooses which containers to assign to which idle
  55 // workers, and decides what to do when there are not enough idle
  56 // workers (including shutting down some idle nodes).
  57 //
  58 //
  59 // A syncer updates state to Cancelled when a running container
  60 // process dies without finalizing its entry in the controller
  61 // database. It also calls the worker pool to kill containers that
  62 // have priority=0 while locked or running.
  63 //
  64 //
  65 // A provider proxy wraps a provider with rate-limiting logic. After
  66 // the wrapped provider receives a cloud.RateLimitError, the proxy
  67 // starts returning errors to callers immediately without calling
  68 // through to the wrapped provider.
  69 //
  70 //
  71 // TBD: Bootstrapping script via SSH, too? Future version.
  72 //
  73 // TBD: drain instance, keep instance alive
  74 // TBD: metrics, diagnostics
  75 // TBD: why dispatch token currently passed to worker?
  76 //
  77 // Metrics: queue size, time job has been in queued, #idle/busy/booting nodes
  78 // Timing in each step, and end-to-end
  79 // Metrics: boot/idle/alloc time and cost