SPDX-License-Identifier: CC-BY-SA-3.0
{% endcomment %}
-Arvados services support endpoints for monitoring the performance of a cluster.
+Some Arvados services publish Prometheus/OpenMetrics-compatible metrics at @/metrics@. Metrics can help you understand how components perform under load, find performance bottlenecks, and detect and diagnose problems.
+
+To access metrics endpoints, services must be configured with a "management token":management-token.html. When accessing a metrics endpoint, prefix the management token with @"Bearer "@ and supply it in the @Authorization@ request header.
+
+<pre>curl -sfH "Authorization: Bearer your_management_token_goes_here" "https://0.0.0.0:25107/metrics"
+</pre>
+
+The plain text export format includes "help" messages with a description of each reported metric.
+
+When configuring Prometheus, use a @bearer_token@ or @bearer_token_file@ option to authenticate requests.
+
+<pre>scrape_configs:
+ - job_name: keepstore
+ bearer_token: your_management_token_goes_here
+ static_configs:
+ - targets:
+ - "keep0.ClusterID.example.com:25107"
+</pre>
+
+table(table table-bordered table-condensed table-hover).
+|_. Component|_. Metrics endpoint|
+|arvados-api-server||
+|arvados-controller|✓|
+|arvados-dispatch-cloud|✓|
+|arvados-git-httpd||
+|arvados-node-manager||
+|arvados-ws|✓|
+|composer||
+|keepproxy||
+|keepstore|✓|
+|keep-balance|✓|
+|keep-web|✓|
+|workbench1||
+|workbench2||
+
+h2. Node manager
+
+The node manager does not export prometheus-style metrics, but its @/status.json@ endpoint provides a snapshot of internal status at the time of the most recent wishlist update.
+
+<pre>curl -sfH "Authorization: Bearer your_management_token_goes_here" "http://0.0.0.0:8989/status.json"
+</pre>
+
+table(table table-bordered table-condensed).
+|_. Attribute|_. Type|_. Description|
+|nodes_booting|int|Number of nodes in booting state|
+|nodes_unpaired|int|Number of nodes in unpaired state|
+|nodes_busy|int|Number of nodes in busy state|
+|nodes_idle|int|Number of nodes in idle state|
+|nodes_fail|int|Number of nodes in fail state|
+|nodes_down|int|Number of nodes in down state|
+|nodes_shutdown|int|Number of nodes in shutdown state|
+|nodes_wish|int|Number of nodes in the current wishlist|
+|node_quota|int|Current node count ceiling due to cloud quota limits|
+|config_max_nodes|int|Configured max node count|
+
+h3. Example
+
+<pre>
+{
+ "actor_exceptions": 0,
+ "idle_times": {
+ "compute1": 0,
+ "compute3": 0,
+ "compute2": 0,
+ "compute4": 0
+ },
+ "create_node_errors": 0,
+ "destroy_node_errors": 0,
+ "nodes_idle": 0,
+ "config_max_nodes": 8,
+ "list_nodes_errors": 0,
+ "node_quota": 8,
+ "Version": "1.1.4.20180719160944",
+ "nodes_wish": 0,
+ "nodes_unpaired": 0,
+ "nodes_busy": 4,
+ "boot_failures": 0
+}
+</pre>