SPDX-License-Identifier: CC-BY-SA-3.0
{% endcomment %}
-Health check endpoints are found at @/_health/ping@ on many Arvados services. The purpose of the health check is to be a simple method of determining if a service can be contacted and if it believes it is functioning properly, suitable for integrating into operational alert systems.
+Health check endpoints are found at @/_health/ping@ on many Arvados services. The purpose of the health check is to offer a simple method of determining if a service can be reached and allow the service to self-report any problems, suitable for integrating into operational alert systems.
-Health check endpoints must be configured with a "management token":management-token.html .
+To access health check endpoints, services must be configured with a "management token":management-token.html .
-This endpoint returns a JSON object with the field @health@. This has a value of either @OK@ or @ERROR@. On error, it may also include a field @error@ with additional information. Examples:
+Health check endpoints return a JSON object with the field @health@. This has a value of either @OK@ or @ERROR@. On error, it may also include a field @error@ with additional information. Examples:
<pre>
{
<pre>
{
"health": "ERROR"
- "error": "Inverted polarity of the warp core"
+ "error": "Inverted polarity in the warp core"
}
</pre>
The service @arvados-health@ performs health checks on all configured services and returns a single value of @OK@ or @ERROR@ for the entire cluster. It exposes the endpoint @/_health/all@ .
-The healthcheck aggregator uses the "NodeProfile" section of the cluster-wide configuration file. Here is an example.
+The healthcheck aggregator uses the @NodeProfile@ section of the cluster-wide @arvados.yml@ configuration file. Here is an example.
<pre>
Cluster:
h2. API server
-Set @MangementToken@ in @application.yml@
+Set @MangementToken@ in the appropriate section of @application.yml@
<pre>
+production:
# Token to be included in all healthcheck requests. Disabled by default.
# Server expects request header of the format "Authorization: Bearer xxx"
- ManagementToken: ...
+ ManagementToken: xxx
</pre>
h2. Node Manager
<pre>
[Manage]
-port=8888
-ManagementToken=...
+# The management server responds to http://addr:port/status.json with
+# a snapshot of internal state.
+
+# Management server listening address (default 127.0.0.1)
+#address = 0.0.0.0
+
+# Management server port number (default -1, server is disabled)
+#port = 8989
+
+ManagementToken = xxx
</pre>
h2. Other services
-The following services also support health check. Set @MangementToken@ in the respective yaml config file for each service.
+The following services also support monitoring. Set @MangementToken@ in the respective yaml config file for each service.
* keepstore
* keep-web
Metrics endpoints are found at @/status.json@ on many Arvados services. The purpose of metrics are to provide statistics about the operation of a service, suitable for diagnosing how well a service is performing under load.
-Metrics endpoints must be configured with a "management token":management-token.html .
+To access metrics endpoints, services must be configured with a "management token":management-token.html .
h2. Keepstore
|InProgress| int||
|Queued| int||
+h3. Example response
+
+<pre>
+{
+ "Volumes": [
+ {
+ "Label": "[UnixVolume /var/lib/arvados/keep0]",
+ "Status": {
+ "MountPoint": "/var/lib/arvados/keep0",
+ "DeviceNum": 65029,
+ "BytesFree": 222532972544,
+ "BytesUsed": 435456679936
+ },
+ "InternalStats": {
+ "Errors": 0,
+ "InBytes": 1111,
+ "OutBytes": 0,
+ "OpenOps": 1,
+ "StatOps": 4,
+ "FlockOps": 0,
+ "UtimesOps": 0,
+ "CreateOps": 0,
+ "RenameOps": 0,
+ "UnlinkOps": 0,
+ "ReaddirOps": 0
+ }
+ }
+ ],
+ "BufferPool": {
+ "BytesAllocatedCumulative": 67108864,
+ "BuffersMax": 20,
+ "BuffersInUse": 0
+ },
+ "PullQueue": {
+ "InProgress": 0,
+ "Queued": 0
+ },
+ "TrashQueue": {
+ "InProgress": 0,
+ "Queued": 0
+ },
+ "RequestsCurrent": 1,
+ "RequestsMax": 40,
+ "Version": "dev"
+}
+</pre>
+
h2. Node manager
The node manager status end point provides a snapshot of internal status at the time of the most recent wishlist update.
|nodes_wish|int|Number of nodes in the current wishlist|
|node_quota|int|Current node count ceiling due to cloud quota limits|
|config_max_nodes|int|Configured max node count|
+
+h3. Example
+
+<pre>
+{
+ "actor_exceptions": 0,
+ "idle_times": {
+ "compute1": 0,
+ "compute3": 0,
+ "compute2": 0,
+ "compute4": 0
+ },
+ "create_node_errors": 0,
+ "destroy_node_errors": 0,
+ "nodes_idle": 0,
+ "config_max_nodes": 8,
+ "list_nodes_errors": 0,
+ "node_quota": 8,
+ "Version": "1.1.4.20180719160944",
+ "nodes_wish": 0,
+ "nodes_unpaired": 0,
+ "nodes_busy": 4,
+ "boot_failures": 0
+}
+</pre>
# a snapshot of internal state.
# Management server listening address (default 127.0.0.1)
-#address = 0.0.0.0
+address = 0.0.0.0
# Management server port number (default -1, server is disabled)
-#port = 8989
+port = 8989
+
+MangementToken = xxx
[Daemon]
# The dispatcher can customize the start and stop procedure for