13791: More detail about monitoring

author Peter Amstutz <pamstutz@veritasgenetics.com>

Tue, 24 Jul 2018 19:02:52 +0000 (15:02 -0400)

committer Peter Amstutz <pamstutz@veritasgenetics.com>

Tue, 24 Jul 2018 19:02:52 +0000 (15:02 -0400)
author Peter Amstutz <pamstutz@veritasgenetics.com>
Tue, 24 Jul 2018 19:02:52 +0000 (15:02 -0400)
committer Peter Amstutz <pamstutz@veritasgenetics.com>
Tue, 24 Jul 2018 19:02:52 +0000 (15:02 -0400)
diff --git a/doc/admin/health-checks.html.textile.liquid b/doc/admin/health-checks.html.textile.liquid

index 9370c6ce68a84e48238be7e609107a98e6bef2b6..630c6a178f1cbd39db459c7344ca081bc460604c 100644 (file)
--- a/doc/admin/health-checks.html.textile.liquid
+++ b/doc/admin/health-checks.html.textile.liquid
@@ -10,11 +10,11 @@ Copyright (C) The Arvados Authors. All rights reserved.
  SPDX-License-Identifier: CC-BY-SA-3.0
  {% endcomment %}
  
-Health check endpoints are found at @/_health/ping@ on many Arvados services.  The purpose of the health check is to be a simple method of determining if a service can be contacted and if it believes it is functioning properly, suitable for integrating into operational alert systems.
+Health check endpoints are found at @/_health/ping@ on many Arvados services.  The purpose of the health check is to offer a simple method of determining if a service can be reached and allow the service to self-report any problems, suitable for integrating into operational alert systems.
  
-Health check endpoints must be configured with a "management token":management-token.html .
+To access health check endpoints, services must be configured with a "management token":management-token.html .
  
-This endpoint returns a JSON object with the field @health@.  This has a value of either @OK@ or @ERROR@.  On error, it may also include a  field @error@ with additional information.  Examples:
+Health check endpoints return a JSON object with the field @health@.  This has a value of either @OK@ or @ERROR@.  On error, it may also include a  field @error@ with additional information.  Examples:
  
  <pre>
  {
@@ -25,7 +25,7 @@ This endpoint returns a JSON object with the field @health@.  This has a value o
  <pre>
  {
    "health": "ERROR"
-  "error": "Inverted polarity of the warp core"
+  "error": "Inverted polarity in the warp core"
  }
  </pre>
  
@@ -33,7 +33,7 @@ h2. Healthcheck aggregator
  
  The service @arvados-health@ performs health checks on all configured services and returns a single value of @OK@ or @ERROR@ for the entire cluster.  It exposes the endpoint @/_health/all@ .
  
-The healthcheck aggregator uses the "NodeProfile" section of the cluster-wide configuration file.  Here is an example.
+The healthcheck aggregator uses the @NodeProfile@ section of the cluster-wide @arvados.yml@ configuration file.  Here is an example.
  
  <pre>
  Cluster:
diff --git a/doc/admin/management-token.html.textile.liquid b/doc/admin/management-token.html.textile.liquid

index 33027ad88701b06723cbeed1089e664efec32c0c..306314337ab4d91f44f1d25c5d9f173d2f8b2417 100644 (file)
--- a/doc/admin/management-token.html.textile.liquid
+++ b/doc/admin/management-token.html.textile.liquid
@@ -18,12 +18,13 @@ To access a monitoring endpoint, the requester must provide the HTTP header @Aut
  
  h2. API server
  
-Set @MangementToken@ in @application.yml@
+Set @MangementToken@ in the appropriate section of @application.yml@
  
  <pre>
+production:
    # Token to be included in all healthcheck requests. Disabled by default.
    # Server expects request header of the format "Authorization: Bearer xxx"
-  ManagementToken: ...
+  ManagementToken: xxx
  </pre>
  
  h2. Node Manager
@@ -32,13 +33,21 @@ Set @port@ (the listen port) and @MangementToken@ in the @Manage@ section of @no
  
  <pre>
  [Manage]
-port=8888
-ManagementToken=...
+# The management server responds to http://addr:port/status.json with
+# a snapshot of internal state.
+
+# Management server listening address (default 127.0.0.1)
+#address = 0.0.0.0
+
+# Management server port number (default -1, server is disabled)
+#port = 8989
+
+ManagementToken = xxx
  </pre>
  
  h2. Other services
  
-The following services also support health check.  Set @MangementToken@ in the respective yaml config file for each service.
+The following services also support monitoring.  Set @MangementToken@ in the respective yaml config file for each service.
  
  * keepstore
  * keep-web
diff --git a/doc/admin/metrics.html.textile.liquid b/doc/admin/metrics.html.textile.liquid

index 107431267e75f12f71eeecc1b875431bbf84d222..e41a96ffc48413fd08a99b0ce516994fcf81be47 100644 (file)
--- a/doc/admin/metrics.html.textile.liquid
+++ b/doc/admin/metrics.html.textile.liquid
@@ -12,7 +12,7 @@ SPDX-License-Identifier: CC-BY-SA-3.0
  
  Metrics endpoints are found at @/status.json@ on many Arvados services.  The purpose of metrics are to provide statistics about the operation of a service, suitable for diagnosing how well a service is performing under load.
  
-Metrics endpoints must be configured with a "management token":management-token.html .
+To access metrics endpoints, services must be configured with a "management token":management-token.html .
  
  h2. Keepstore
  
@@ -73,6 +73,53 @@ table(table table-bordered table-condensed).
  |InProgress| int||
  |Queued|     int||
  
+h3. Example response
+
+<pre>
+{
+  "Volumes": [
+    {
+      "Label": "[UnixVolume /var/lib/arvados/keep0]",
+      "Status": {
+        "MountPoint": "/var/lib/arvados/keep0",
+        "DeviceNum": 65029,
+        "BytesFree": 222532972544,
+        "BytesUsed": 435456679936
+      },
+      "InternalStats": {
+        "Errors": 0,
+        "InBytes": 1111,
+        "OutBytes": 0,
+        "OpenOps": 1,
+        "StatOps": 4,
+        "FlockOps": 0,
+        "UtimesOps": 0,
+        "CreateOps": 0,
+        "RenameOps": 0,
+        "UnlinkOps": 0,
+        "ReaddirOps": 0
+      }
+    }
+  ],
+  "BufferPool": {
+    "BytesAllocatedCumulative": 67108864,
+    "BuffersMax": 20,
+    "BuffersInUse": 0
+  },
+  "PullQueue": {
+    "InProgress": 0,
+    "Queued": 0
+  },
+  "TrashQueue": {
+    "InProgress": 0,
+    "Queued": 0
+  },
+  "RequestsCurrent": 1,
+  "RequestsMax": 40,
+  "Version": "dev"
+}
+</pre>
+
  h2. Node manager
  
  The node manager status end point provides a snapshot of internal status at the time of the most recent wishlist update.
@@ -89,3 +136,28 @@ table(table table-bordered table-condensed).
  |nodes_wish|int|Number of nodes in the current wishlist|
  |node_quota|int|Current node count ceiling due to cloud quota limits|
  |config_max_nodes|int|Configured max node count|
+
+h3. Example
+
+<pre>
+{
+  "actor_exceptions": 0,
+  "idle_times": {
+    "compute1": 0,
+    "compute3": 0,
+    "compute2": 0,
+    "compute4": 0
+  },
+  "create_node_errors": 0,
+  "destroy_node_errors": 0,
+  "nodes_idle": 0,
+  "config_max_nodes": 8,
+  "list_nodes_errors": 0,
+  "node_quota": 8,
+  "Version": "1.1.4.20180719160944",
+  "nodes_wish": 0,
+  "nodes_unpaired": 0,
+  "nodes_busy": 4,
+  "boot_failures": 0
+}
+</pre>
diff --git a/services/nodemanager/tests/fake_azure.cfg.template b/services/nodemanager/tests/fake_azure.cfg.template

index a11a6d807ef9348d9a17deac9e0c2092ed929f46..e5deac85d257057292466eadf6fae1e7c5edb8c3 100644 (file)
--- a/services/nodemanager/tests/fake_azure.cfg.template
+++ b/services/nodemanager/tests/fake_azure.cfg.template
@@ -10,10 +10,12 @@
  # a snapshot of internal state.
  
  # Management server listening address (default 127.0.0.1)
-#address = 0.0.0.0
+address = 0.0.0.0
  
  # Management server port number (default -1, server is disabled)
-#port = 8989
+port = 8989
+
+MangementToken = xxx
  
  [Daemon]
  # The dispatcher can customize the start and stop procedure for
author	Peter Amstutz <pamstutz@veritasgenetics.com>
	Tue, 24 Jul 2018 19:02:52 +0000 (15:02 -0400)
committer	Peter Amstutz <pamstutz@veritasgenetics.com>
	Tue, 24 Jul 2018 19:02:52 +0000 (15:02 -0400)
doc/admin/health-checks.html.textile.liquid		patch \| blob \| history
doc/admin/management-token.html.textile.liquid		patch \| blob \| history
doc/admin/metrics.html.textile.liquid		patch \| blob \| history
services/nodemanager/tests/fake_azure.cfg.template		patch \| blob \| history