1 // Copyright (C) The Arvados Authors. All rights reserved.
3 // SPDX-License-Identifier: AGPL-3.0
7 // A dispatcher comprises a container queue, a scheduler, a worker
8 // pool, a cloud provider, a stale-lock fixer, and a syncer.
9 // 1. Choose a provider.
10 // 2. Start a worker pool.
11 // 3. Start a container queue.
12 // 4. Run a stale-lock fixer.
13 // 5. Start a scheduler.
17 // A provider (cloud driver) creates new cloud VM instances and gets
18 // the latest list of instances. The returned instances implement
19 // proxies to the provider's metadata and control interfaces (get IP
20 // address, update tags, shutdown).
23 // A workerPool tracks workers' instance types and readiness states
24 // (available to do work now, booting, suffering a temporary network
25 // outage, shutting down). It loads internal state from the cloud
26 // provider's list of instances at startup, and syncs periodically
30 // A worker maintains a multiplexed SSH connection to a cloud
31 // instance, retrying/reconnecting as needed, so the workerPool can
32 // execute commands. It asks the provider's instance to verify its SSH
33 // public key once when first connecting, and again later if the key
37 // A container queue tracks the known state (according to
38 // arvados-controller) of each container of interest -- i.e., queued,
39 // or locked/running using our own dispatch token. It also proxies the
40 // dispatcher's lock/unlock/cancel requests to the controller. It
41 // handles concurrent refresh and update operations without exposing
42 // out-of-order updates to its callers. (It drops any new information
43 // that might have originated before its own most recent
44 // lock/unlock/cancel operation.)
47 // A stale-lock fixer waits for any already-locked containers (i.e.,
48 // locked by a prior server process) to appear on workers as the
49 // worker pool recovers its state. It unlocks/requeues any that still
50 // remain when all workers are recovered or shutdown, or its timer
54 // A scheduler chooses which containers to assign to which idle
55 // workers, and decides what to do when there are not enough idle
56 // workers (including shutting down some idle nodes).
59 // A syncer updates state to Cancelled when a running container
60 // process dies without finalizing its entry in the controller
61 // database. It also calls the worker pool to kill containers that
62 // have priority=0 while locked or running.
65 // A provider proxy wraps a provider with rate-limiting logic. After
66 // the wrapped provider receives a cloud.RateLimitError, the proxy
67 // starts returning errors to callers immediately without calling
68 // through to the wrapped provider.
71 // TBD: Bootstrapping script via SSH, too? Future version.
73 // TBD: drain instance, keep instance alive
74 // TBD: metrics, diagnostics
75 // TBD: why dispatch token currently passed to worker?
77 // Metrics: queue size, time job has been in queued, #idle/busy/booting nodes
78 // Timing in each step, and end-to-end
79 // Metrics: boot/idle/alloc time and cost