Refs #18797. Closes #19788.
Arvados-DCO-1.1-Signed-off-by: Brett Smith <brett.smith@curii.com>
- admin/logging.html.textile.liquid
- admin/metrics.html.textile.liquid
- admin/health-checks.html.textile.liquid
+ - admin/diagnostics.html.textile.liquid
- admin/management-token.html.textile.liquid
- admin/user-activity.html.textile.liquid
- Data Management:
--- /dev/null
+---
+layout: default
+navsection: admin
+title: Diagnostics
+...
+
+{% comment %}
+Copyright (C) The Arvados Authors. All rights reserved.
+
+SPDX-License-Identifier: CC-BY-SA-3.0
+{% endcomment %}
+
+The @arvados-client diagnostics@ command exercises basic cluster functionality, and identifies some common installation and configuration problems. Especially after upgrading or reconfiguring Arvados or server/network infrastructure, it can be the quickest way to identify problems.
+
+h2. Using system privileges
+
+On a server node, it is easiest to run the diagnostics command with system privileges. The word @sudo@ here instructs the @arvados-client@ command to load @Controller.ExternalURL@ and @SystemRootToken@ from @/etc/arvados/config.yml@ and use those credentials to run tests with system privileges.
+
+When run this way, diagnostics will also include "health checks":health-checks.html.
+
+<notextile><pre>
+# <span class="userinput">arvados-client sudo diagnostics</span>
+</pre></notextile>
+
+h2. Using regular user privileges
+
+On any node (server node, shell node, or a workstation outside the system network), you can also run diagnostics by setting the usual @ARVADOS_API_HOST@ and @ARVADOS_API_TOKEN@ environment variables. Typically this is done with a regular user account.
+
+<notextile><pre>
+$ <span class="userinput">export ARVADOS_API_HOST=zzzzz.arvadosapi.com</span>
+$ <span class="userinput">export ARVADOS_API_TOKEN=xxxxxxxxxx</span>
+$ <span class="userinput">arvados-client diagnostics</span>
+</pre></notextile>
+
+h2. Internal/external client detection
+
+The diagnostics output indicates whether its client connection is categorized by the server as internal or external. If you run diagnostics automatically with cron or a monitoring tool, you can use the @-internal-client@ or @-external-client@ flag to specify how you _expect_ the client to be categorized, and the test will fail otherwise. Example:
+
+<notextile><pre>
+# <span class="userinput">arvados-client sudo diagnostics -internal-client</span>
+[...]
+
+--- cut here --- error summary ---
+
+ERROR 60: checking internal/external client detection (11 ms): expecting internal=true external=false, but found internal=false external=true
+</pre></notextile>
+
+h2. Example output
+
+<notextile><pre>
+# <span class="userinput">arvados-client sudo diagnostics</span>
+INFO 5: running health check (same as `arvados-server check`)
+INFO 10: getting discovery document from https://zzzzz.arvadosapi.com/discovery/v1/apis/arvados/v1/rest
+INFO 20: getting exported config from https://zzzzz.arvadosapi.com/arvados/v1/config
+INFO 30: getting current user record
+INFO 40: connecting to service endpoint https://keep.zzzzz.arvadosapi.com/
+INFO 41: connecting to service endpoint https://*.collections.zzzzz.arvadosapi.com/
+INFO 42: connecting to service endpoint https://download.zzzzz.arvadosapi.com/
+INFO 43: connecting to service endpoint wss://ws.zzzzz.arvadosapi.com/websocket
+INFO 44: connecting to service endpoint https://workbench.zzzzz.arvadosapi.com/
+INFO 45: connecting to service endpoint https://workbench2.zzzzz.arvadosapi.com/
+INFO 50: checking CORS headers at https://zzzzz.arvadosapi.com/
+INFO 51: checking CORS headers at https://keep.zzzzz.arvadosapi.com/d41d8cd98f00b204e9800998ecf8427e+0
+INFO 52: checking CORS headers at https://download.zzzzz.arvadosapi.com/
+INFO 60: checking internal/external client detection
+INFO 61: reading+writing via keep service at https://keep.zzzzz.arvadosapi.com:443/
+INFO 80: finding/creating "scratch area for diagnostics" project
+INFO 90: creating temporary collection
+INFO 100: uploading file via webdav
+INFO 110: checking WebDAV ExternalURL wildcard (https://*.collections.zzzzz.arvadosapi.com/)
+INFO 120: downloading from webdav (https://d41d8cd98f00b204e9800998ecf8427e-0.collections.zzzzz.arvadosapi.com/foo)
+INFO 121: downloading from webdav (https://d41d8cd98f00b204e9800998ecf8427e-0.collections.zzzzz.arvadosapi.com/sha256:feb5d9fea6a5e9606aa995e879d862b825965ba48de054caab5ef356dc6b3412.tar)
+INFO 122: downloading from webdav (https://download.zzzzz.arvadosapi.com/c=d41d8cd98f00b204e9800998ecf8427e+0/_/foo)
+INFO 123: downloading from webdav (https://download.zzzzz.arvadosapi.com/c=d41d8cd98f00b204e9800998ecf8427e+0/_/sha256:feb5d9fea6a5e9606aa995e879d862b825965ba48de054caab5ef356dc6b3412.tar)
+INFO 124: downloading from webdav (https://a15a27cbc1c7d2d4a0d9e02529aaec7e-128.collections.zzzzz.arvadosapi.com/sha256:feb5d9fea6a5e9606aa995e879d862b825965ba48de054caab5ef356dc6b3412.tar)
+INFO 125: downloading from webdav (https://download.zzzzz.arvadosapi.com/c=zzzzz-4zz18-twitqma8mbvwydy/_/sha256:feb5d9fea6a5e9606aa995e879d862b825965ba48de054caab5ef356dc6b3412.tar)
+INFO 130: getting list of virtual machines
+INFO 140: getting workbench1 webshell page
+INFO 150: connecting to webshell service
+INFO 160: running a container
+INFO ... container request submitted, waiting up to 10m for container to run
+INFO 9990: deleting temporary collection
+</pre></notextile>
}
</pre>
-h2. Healthcheck aggregator
+h2. Health check aggregator
The service @arvados-health@ performs health checks on all configured services and returns a single value of @OK@ or @ERROR@ for the entire cluster. It exposes the endpoint @/_health/all@ .
The healthcheck aggregator uses the @Services@ section of the cluster-wide @config.yml@ configuration file.
+
+h2. Health check command
+
+The @arvados-server check@ command is another way to perform the same health checks as the health check aggregator service. It does not depend on the aggregator service.
+
+If all checks pass, it writes @health check OK@ to stderr (unless the @-quiet@ flag is used) and exits 0. Otherwise, it writes error messages to stderr and exits with error status.
+
+@arvados-server check -yaml@ outputs a YAML document on stdout with additional details about each service endpoint that was checked.
+
+{% codeblock as yaml %}
+Checks:
+ "arvados-api-server+http://localhost:8004/_health/ping":
+ ClockTime: "2022-11-16T16:08:57Z"
+ ConfigSourceSHA256: e2c086ae3dd290cf029cb3fe79146529622279b6280cf6cd17dc8d8c30daa57f
+ ConfigSourceTimestamp: "2022-11-07T18:08:24.539545Z"
+ HTTPStatusCode: 200
+ Health: OK
+ Response:
+ health: OK
+ ResponseTime: 0.017159
+ Server: nginx/1.14.0 + Phusion Passenger(R) 6.0.15
+ Version: 2.5.0~dev20221116141533
+ "arvados-controller+http://localhost:8003/_health/ping":
+ ClockTime: "2022-11-16T16:08:57Z"
+ ConfigSourceSHA256: e2c086ae3dd290cf029cb3fe79146529622279b6280cf6cd17dc8d8c30daa57f
+ ConfigSourceTimestamp: "2022-11-07T18:08:24.539545Z"
+ HTTPStatusCode: 200
+ Health: OK
+ Response:
+ health: OK
+ ResponseTime: 0.004748
+ Server: ""
+ Version: 2.5.0~dev20221116141533 (go1.18.8)
+# ...
+{% endcodeblock %}
h2(#main). development main (as of 2022-10-31)
-"previous: Upgrading to 2.4.3":#v2_4_3
+"previous: Upgrading to 2.4.4":#v2_4_4
h3. Google or OpenID Connect login restricted to trusted clients
h3. New keepstore S3 driver enabled by default
A more actively maintained S3 client library is now enabled by default for keeepstore services. The previous driver is still available for use in case of unknown issues. To use the old driver, set @DriverParameters.UseAWSS3v2Driver@ to @false@ on the appropriate @Volumes@ config entries.
-h2(#main). development main (as of 2022-10-14)
h3. Old container logs are automatically deleted from PostgreSQL
The config entries @Collections.WebDAVCache.UUIDTTL@, @...MaxCollectionEntries@, and @...MaxUUIDEntries@ are no longer used, and should be removed from your config file.
+h2(#v2_4_4). v2.4.4 (2022-11-18)
+
+"previous: Upgrading to 2.4.3":#v2_4_3
+
+This update only consists of improvements to @arvados-cwl-runner@. There are no changes to backend services.
+
h2(#v2_4_3). v2.4.3 (2022-09-21)
"previous: Upgrading to 2.4.2":#v2_4_2
isInternal := found["proxy"] == 0 && len(keeplist.Items) > 0
isExternal := found["proxy"] > 0 && found["proxy"] == len(keeplist.Items)
if isExternal {
- diag.verbosef("controller returned only proxy services, this host is treated as \"external\"")
+ diag.infof("controller returned only proxy services, this host is treated as \"external\"")
} else if isInternal {
- diag.verbosef("controller returned only non-proxy services, this host is treated as \"internal\"")
+ diag.infof("controller returned only non-proxy services, this host is treated as \"internal\"")
}
if (diag.checkInternal && !isInternal) || (diag.checkExternal && !isExternal) {
return fmt.Errorf("expecting internal=%v external=%v, but found internal=%v external=%v", diag.checkInternal, diag.checkExternal, isInternal, isExternal)
timeout := 10 * time.Minute
diag.infof("container request submitted, waiting up to %v for container to run", arvados.Duration(timeout))
- ctx, cancel = context.WithDeadline(context.Background(), time.Now().Add(timeout))
- defer cancel()
+ deadline := time.Now().Add(timeout)
var c arvados.Container
- for ; cr.State != arvados.ContainerRequestStateFinal; time.Sleep(2 * time.Second) {
- ctx, cancel := context.WithDeadline(ctx, time.Now().Add(diag.timeout))
+ for ; cr.State != arvados.ContainerRequestStateFinal && time.Now().Before(deadline); time.Sleep(2 * time.Second) {
+ ctx, cancel := context.WithDeadline(context.Background(), time.Now().Add(diag.timeout))
defer cancel()
crStateWas := cr.State
if c.State != cStateWas {
diag.debugf("container state = %s", c.State)
}
+
+ cancel()
}
+ if cr.State != arvados.ContainerRequestStateFinal {
+ err := client.RequestAndDecodeContext(context.Background(), &cr, "PATCH", "arvados/v1/container_requests/"+cr.UUID, nil, map[string]interface{}{
+ "container_request": map[string]interface{}{
+ "priority": 0,
+ }})
+ if err != nil {
+ diag.infof("error canceling container request %s: %s", cr.UUID, err)
+ } else {
+ diag.debugf("canceled container request %s", cr.UUID)
+ }
+ return fmt.Errorf("timed out waiting for container to finish; container request %s state was %q, container %s state was %q", cr.UUID, cr.State, c.UUID, c.State)
+ }
if c.State != arvados.ContainerStateComplete {
return fmt.Errorf("container request %s is final but container %s did not complete: container state = %q", cr.UUID, cr.ContainerUUID, c.State)
- } else if c.ExitCode != 0 {
+ }
+ if c.ExitCode != 0 {
return fmt.Errorf("container exited %d", c.ExitCode)
}
return nil
versionFlag := flags.Bool("version", false, "Write version information to stdout and exit 0")
timeout := flags.Duration("timeout", defaultTimeout.Duration(), "Maximum time to wait for health responses")
quiet := flags.Bool("quiet", false, "Silent on success (suppress 'health check OK' message on stderr)")
- outputYAML := flags.Bool("yaml", false, "Output full health report in YAML format (default mode shows errors as plain text, is silent on success)")
+ outputYAML := flags.Bool("yaml", false, "Output full health report in YAML format (default mode prints 'health check OK' or plain text errors)")
if ok, _ := cmd.ParseFlags(flags, prog, args, "", stderr); !ok {
// cmd.ParseFlags already reported the error
return errSilent
conn.exec_query 'SAVEPOINT save_with_unique_name'
begin
save!
+ conn.exec_query 'RELEASE SAVEPOINT save_with_unique_name'
rescue ActiveRecord::RecordNotUnique => rn
raise if max_retries == 0
max_retries -= 1
- conn.exec_query 'ROLLBACK TO SAVEPOINT save_with_unique_name'
-
# Dig into the error to determine if it is specifically calling out a
# (owner_uuid, name) uniqueness violation. In this specific case, and
# the client requested a unique name with ensure_unique_name==true,
detail = err.result.error_field(PG::Result::PG_DIAG_MESSAGE_DETAIL)
raise unless /^Key \(owner_uuid, name\)=\([a-z0-9]{5}-[a-z0-9]{5}-[a-z0-9]{15}, .*?\) already exists\./.match detail
+ conn.exec_query 'ROLLBACK TO SAVEPOINT save_with_unique_name'
+
new_name = "#{name_was} (#{db_current_time.utc.iso8601(3)})"
if new_name == name
# If the database is fast enough to do two attempts in the
self[:current_version_uuid] = nil
end
end
- conn.exec_query 'SAVEPOINT save_with_unique_name'
+
retry
- ensure
- conn.exec_query 'RELEASE SAVEPOINT save_with_unique_name'
end
end
end
"Expected 'duplicate key' error in #{response_errors.first}")
end
+ [false, true].each do |ensure_unique_name|
+ test "create failure with duplicate name, ensure_unique_name #{ensure_unique_name}" do
+ authorize_with :active
+ post :create, params: {
+ collection: {
+ owner_uuid: users(:active).uuid,
+ manifest_text: "",
+ name: "this...............................................................................................................................................................................................................................................................name is too long"
+ },
+ ensure_unique_name: ensure_unique_name
+ }
+ assert_response 422
+ # check the real error isn't masked by an
+ # ensure_unique_name-related error (#19698)
+ assert_match /value too long for type/, json_response['errors'][0]
+ end
+ end
+
[false, true].each do |unsigned|
test "create with duplicate name, ensure_unique_name, unsigned=#{unsigned}" do
permit_unsigned_manifests unsigned