// Copyright (C) The Arvados Authors. All rights reserved. // // SPDX-License-Identifier: AGPL-3.0 package dispatchcloud // A dispatcher comprises a container queue, a scheduler, a worker // pool, a remote command executor, and a cloud driver. // 1. Choose a provider. // 2. Start a worker pool. // 3. Start a container queue. // 4. Run the scheduler's stale-lock fixer. // 5. Run the scheduler's mapper. // 6. Run the scheduler's syncer. // 7. Wait for updates to the container queue or worker pool. // 8. Repeat from 5. // // // A cloud driver creates new cloud VM instances and gets the latest // list of instances. The returned instances are caches/proxies for // the provider's metadata and control interfaces (get IP address, // update tags, shutdown). // // // A worker pool tracks workers' instance types and readiness states // (available to do work now, booting, suffering a temporary network // outage, shutting down). It loads internal state from the cloud // provider's list of instances at startup, and syncs periodically // after that. // // // An executor maintains a multiplexed SSH connection to a cloud // instance, retrying/reconnecting as needed, so the worker pool can // execute commands. It asks the cloud driver's instance to verify its // SSH public key once when first connecting, and again later if the // key changes. // // // A container queue tracks the known state (according to // arvados-controller) of each container of interest -- i.e., queued, // or locked/running using our own dispatch token. It also proxies the // dispatcher's lock/unlock/cancel requests to the controller. It // handles concurrent refresh and update operations without exposing // out-of-order updates to its callers. (It drops any new information // that might have originated before its own most recent // lock/unlock/cancel operation.) // // // The scheduler's stale-lock fixer waits for any already-locked // containers (i.e., locked by a prior dispatcher process) to appear // on workers as the worker pool recovers its state. It // unlocks/requeues any that still remain when all workers are // recovered or shutdown, or its timer expires. // // // The scheduler's mapper chooses which containers to assign to which // idle workers, and decides what to do when there are not enough idle // workers (including shutting down some idle nodes). // // // The scheduler's syncer updates state to Cancelled when a running // container process dies without finalizing its entry in the // controller database. It also calls the worker pool to kill // containers that have priority=0 while locked or running. // // // An instance set proxy wraps a driver's instance set with // rate-limiting logic. After the wrapped instance set receives a // cloud.RateLimitError, the proxy starts returning errors to callers // immediately without calling through to the wrapped instance set.