13996: Improve docs a bit
[arvados.git] / doc / admin / metrics.html.textile.liquid
1 ---
2 layout: default
3 navsection: admin
4 title: Metrics
5 ...
6
7 {% comment %}
8 Copyright (C) The Arvados Authors. All rights reserved.
9
10 SPDX-License-Identifier: CC-BY-SA-3.0
11 {% endcomment %}
12
13 Some Arvados services publish Prometheus/OpenMetrics-compatible metrics at @/metrics@, and some provide additional runtime status at @/status.json@.  Metrics can help you understand how components perform under load, find performance bottlenecks, and detect and diagnose problems.
14
15 To access metrics endpoints, services must be configured with a "management token":management-token.html. When accessing a metrics endpoint, prefix the management token with @"Bearer "@ and supply it in the @Authorization@ request header.
16
17 <pre>curl -sfH "Authorization: Bearer your_management_token_goes_here" "https://0.0.0.0:25107/status.json"
18 </pre>
19
20 h2. Keep-web
21
22 Keep-web exports metrics at @/metrics@ -- e.g., @https://collections.zzzzz.arvadosapi.com/metrics@.
23
24 table(table table-bordered table-condensed).
25 |_. Name|_. Type|_. Description|
26 |request_duration_seconds|summary|elapsed time between receiving a request and sending the last byte of the response body (segmented by HTTP request method and response status code)|
27 |time_to_status_seconds|summary|elapsed time between receiving a request and sending the HTTP response status code (segmented by HTTP request method and response status code)|
28
29 Metrics in the @arvados_keepweb_collectioncache@ namespace report keep-web's internal cache of Arvados collection metadata.
30
31 table(table table-bordered table-condensed).
32 |_. Name|_. Type|_. Description|
33 |arvados_keepweb_collectioncache_requests|counter|cache lookups|
34 |arvados_keepweb_collectioncache_api_calls|counter|outgoing API calls|
35 |arvados_keepweb_collectioncache_permission_hits|counter|collection-to-permission cache hits|
36 |arvados_keepweb_collectioncache_pdh_hits|counter|UUID-to-PDH cache hits|
37 |arvados_keepweb_collectioncache_hits|counter|PDH-to-manifest cache hits|
38 |arvados_keepweb_collectioncache_cached_manifests|gauge|number of collections in the cache|
39 |arvados_keepweb_collectioncache_cached_manifest_bytes|gauge|memory consumed by cached collection manifests|
40
41 h2. Keepstore
42
43 Keepstore exports metrics at @/status.json@ -- e.g., @http://keep0.zzzzz.arvadosapi.com:25107/status.json@.
44
45 h3. Root
46
47 table(table table-bordered table-condensed).
48 |_. Attribute|_. Type|_. Description|
49 |Volumes|         array of "volumeStatusEnt":#volumeStatusEnt ||
50 |BufferPool|      "PoolStatus":#PoolStatus ||
51 |PullQueue|       "WorkQueueStatus":#WorkQueueStatus ||
52 |TrashQueue|      "WorkQueueStatus":#WorkQueueStatus ||
53 |RequestsCurrent| int ||
54 |RequestsMax|     int ||
55 |Version|         string ||
56
57 h3(#volumeStatusEnt). volumeStatusEnt
58
59 table(table table-bordered table-condensed).
60 |_. Attribute|_. Type|_. Description|
61 |Label|         string||
62 |Status|        "VolumeStatus":#VolumeStatus ||
63 |VolumeStats|   "ioStats":#ioStats ||
64
65 h3(#VolumeStatus). VolumeStatus
66
67 table(table table-bordered table-condensed).
68 |_. Attribute|_. Type|_. Description|
69 |MountPoint| string||
70 |DeviceNum|  uint64||
71 |BytesFree|  uint64||
72 |BytesUsed|  uint64||
73
74 h3(#ioStats). ioStats
75
76 table(table table-bordered table-condensed).
77 |_. Attribute|_. Type|_. Description|
78 |Errors|     uint64||
79 |Ops|        uint64||
80 |CompareOps| uint64||
81 |GetOps|     uint64||
82 |PutOps|     uint64||
83 |TouchOps|   uint64||
84 |InBytes|    uint64||
85 |OutBytes|   uint64||
86
87 h3(#PoolStatus). PoolStatus
88
89 table(table table-bordered table-condensed).
90 |_. Attribute|_. Type|_. Description|
91 |BytesAllocatedCumulative|       uint64||
92 |BuffersMax|    int||
93 |BuffersInUse|  int||
94
95 h3(#WorkQueueStatus). WorkQueueStatus
96
97 table(table table-bordered table-condensed).
98 |_. Attribute|_. Type|_. Description|
99 |InProgress| int||
100 |Queued|     int||
101
102 h3. Example response
103
104 <pre>
105 {
106   "Volumes": [
107     {
108       "Label": "[UnixVolume /var/lib/arvados/keep0]",
109       "Status": {
110         "MountPoint": "/var/lib/arvados/keep0",
111         "DeviceNum": 65029,
112         "BytesFree": 222532972544,
113         "BytesUsed": 435456679936
114       },
115       "InternalStats": {
116         "Errors": 0,
117         "InBytes": 1111,
118         "OutBytes": 0,
119         "OpenOps": 1,
120         "StatOps": 4,
121         "FlockOps": 0,
122         "UtimesOps": 0,
123         "CreateOps": 0,
124         "RenameOps": 0,
125         "UnlinkOps": 0,
126         "ReaddirOps": 0
127       }
128     }
129   ],
130   "BufferPool": {
131     "BytesAllocatedCumulative": 67108864,
132     "BuffersMax": 20,
133     "BuffersInUse": 0
134   },
135   "PullQueue": {
136     "InProgress": 0,
137     "Queued": 0
138   },
139   "TrashQueue": {
140     "InProgress": 0,
141     "Queued": 0
142   },
143   "RequestsCurrent": 1,
144   "RequestsMax": 40,
145   "Version": "dev"
146 }
147 </pre>
148
149 h2. Keep-balance
150
151 Keep-balance exports metrics at @/metrics@ -- e.g., @http://keep.zzzzz.arvadosapi.com:9005/metrics@.
152
153 table(table table-bordered table-condensed).
154 |_. Name|_. Type|_. Description|
155 |arvados_keep_total_{replicas,blocks,bytes}|gauge|stored data (stored in backend volumes, whether referenced or not)|
156 |arvados_keep_garbage_{replicas,blocks,bytes}|gauge|garbage data (unreferenced, and old enough to trash)|
157 |arvados_keep_transient_{replicas,blocks,bytes}|gauge|transient data (unreferenced, but too new to trash)|
158 |arvados_keep_overreplicated_{replicas,blocks,bytes}|gauge|overreplicated data (more replicas exist than are needed)|
159 |arvados_keep_underreplicated_{replicas,blocks,bytes}|gauge|underreplicated data (fewer replicas exist than are needed)|
160 |arvados_keep_lost_{replicas,blocks,bytes}|gauge|lost data (referenced by collections, but not found on any backend volume)|
161 |arvados_keep_dedup_block_ratio|gauge|deduplication ratio (block references in collections &divide; distinct blocks referenced)|
162 |arvados_keep_dedup_byte_ratio|gauge|deduplication ratio (block references in collections &divide; distinct blocks referenced, weighted by block size)|
163 |arvados_keepbalance_get_state_seconds|summary|time to get all collections and keepstore volume indexes for one iteration|
164 |arvados_keepbalance_changeset_compute_seconds|summary|time to compute changesets for one iteration|
165 |arvados_keepbalance_send_pull_list_seconds|summary|time to send pull lists to all keepstore servers for one iteration|
166 |arvados_keepbalance_send_trash_list_seconds|summary|time to send trash lists to all keepstore servers for one iteration|
167 |arvados_keepbalance_sweep_seconds|summary|time to complete one iteration|
168
169 Each @arvados_keep_@ storage state statistic above is presented as a set of three metrics:
170
171 table(table table-bordered table-condensed).
172 |*_blocks|distinct block hashes|
173 |*_bytes|bytes stored on backend volumes|
174 |*_replicas|objects/files stored on backend volumes|
175
176 h2. Node manager
177
178 The node manager status end point provides a snapshot of internal status at the time of the most recent wishlist update.
179
180 table(table table-bordered table-condensed).
181 |_. Attribute|_. Type|_. Description|
182 |nodes_booting|int|Number of nodes in booting state|
183 |nodes_unpaired|int|Number of nodes in unpaired state|
184 |nodes_busy|int|Number of nodes in busy state|
185 |nodes_idle|int|Number of nodes in idle state|
186 |nodes_fail|int|Number of nodes in fail state|
187 |nodes_down|int|Number of nodes in down state|
188 |nodes_shutdown|int|Number of nodes in shutdown state|
189 |nodes_wish|int|Number of nodes in the current wishlist|
190 |node_quota|int|Current node count ceiling due to cloud quota limits|
191 |config_max_nodes|int|Configured max node count|
192
193 h3. Example
194
195 <pre>
196 {
197   "actor_exceptions": 0,
198   "idle_times": {
199     "compute1": 0,
200     "compute3": 0,
201     "compute2": 0,
202     "compute4": 0
203   },
204   "create_node_errors": 0,
205   "destroy_node_errors": 0,
206   "nodes_idle": 0,
207   "config_max_nodes": 8,
208   "list_nodes_errors": 0,
209   "node_quota": 8,
210   "Version": "1.1.4.20180719160944",
211   "nodes_wish": 0,
212   "nodes_unpaired": 0,
213   "nodes_busy": 4,
214   "boot_failures": 0
215 }
216 </pre>