admin:
- Topics:
- admin/index.html.textile.liquid
+ - Upgrading and migrations:
- admin/upgrading.html.textile.liquid
+ - install/migrate-docker19.html.textile.liquid
+ - Users and Groups:
- install/cheat_sheet.html.textile.liquid
- - user/topics/arvados-sync-groups.html.textile.liquid
- - admin/storage-classes.html.textile.liquid
- admin/activation.html.textile.liquid
- - admin/migrating-providers.html.textile.liquid
- admin/merge-remote-account.html.textile.liquid
+ - admin/migrating-providers.html.textile.liquid
+ - user/topics/arvados-sync-groups.html.textile.liquid
+ - Monitoring:
+ - admin/health-checks.html.textile.liquid
+ - admin/metrics.html.textile.liquid
+ - admin/management-token.html.textile.liquid
+ - Cloud:
+ - admin/storage-classes.html.textile.liquid
- admin/spot-instances.html.textile.liquid
- - install/migrate-docker19.html.textile.liquid
installguide:
- Overview:
- install/index.html.textile.liquid
--- /dev/null
+---
+layout: default
+navsection: admin
+title: Health checks
+...
+
+{% comment %}
+Copyright (C) The Arvados Authors. All rights reserved.
+
+SPDX-License-Identifier: CC-BY-SA-3.0
+{% endcomment %}
+
+Health check endpoints are found at @/_health/ping@ on many Arvados services. The purpose of the health check is to offer a simple method of determining if a service can be reached and allow the service to self-report any problems, suitable for integrating into operational alert systems.
+
+To access health check endpoints, services must be configured with a "management token":management-token.html .
+
+Health check endpoints return a JSON object with the field @health@. This has a value of either @OK@ or @ERROR@. On error, it may also include a field @error@ with additional information. Examples:
+
+<pre>
+{
+ "health": "OK"
+}
+</pre>
+
+<pre>
+{
+ "health": "ERROR"
+ "error": "Inverted polarity in the warp core"
+}
+</pre>
+
+h2. Healthcheck aggregator
+
+The service @arvados-health@ performs health checks on all configured services and returns a single value of @OK@ or @ERROR@ for the entire cluster. It exposes the endpoint @/_health/all@ .
+
+The healthcheck aggregator uses the @NodeProfile@ section of the cluster-wide @arvados.yml@ configuration file. Here is an example.
+
+<pre>
+Cluster:
+ # The cluster uuid prefix
+ zzzzz:
+ NodeProfile:
+ # For each node, the profile name corresponds to a
+ # locally-resolvable hostname, and describes which Arvados
+ # services are available on that machine.
+ api:
+ arvados-controller:
+ Listen: 8000
+ arvados-api-server:
+ Listen: 8001
+ manage:
+ arvados-node-manager:
+ Listen: 8002
+ workbench:
+ arvados-workbench:
+ Listen: 8003
+ arvados-ws:
+ Listen: 8004
+ keep:
+ keep-web:
+ Listen: 8005
+ keepproxy:
+ Listen: 8006
+ keep0:
+ keepstore:
+ Listen: 25701
+ keep1:
+ keepstore:
+ Listen: 25701
+</pre>
--- /dev/null
+---
+layout: default
+navsection: admin
+title: Management token
+...
+
+{% comment %}
+Copyright (C) The Arvados Authors. All rights reserved.
+
+SPDX-License-Identifier: CC-BY-SA-3.0
+{% endcomment %}
+
+To enable and collect health checks and metrics, services must be configured with a "management token".
+
+Services must have ManagementToken configured. This is used to authorize access monitoring endpoints. If ManagementToken is not configured, monitoring endpoints will return the error @404 disabled@.
+
+To access a monitoring endpoint, the requester must provide the HTTP header @Authorization: Bearer (ManagementToken)@.
+
+h2. API server
+
+Set @MangementToken@ in the appropriate section of @application.yml@
+
+<pre>
+production:
+ # Token to be included in all healthcheck requests. Disabled by default.
+ # Server expects request header of the format "Authorization: Bearer xxx"
+ ManagementToken: xxx
+</pre>
+
+h2. Node Manager
+
+Set @port@ (the listen port) and @MangementToken@ in the @Manage@ section of @node-manager.ini@ .
+
+<pre>
+[Manage]
+# The management server responds to http://addr:port/status.json with
+# a snapshot of internal state.
+
+# Management server listening address (default 127.0.0.1)
+#address = 0.0.0.0
+
+# Management server port number (default -1, server is disabled)
+#port = 8989
+
+ManagementToken = xxx
+</pre>
+
+h2. Other services
+
+The following services also support monitoring. Set @MangementToken@ in the respective yaml config file for each service.
+
+* keepstore
+* keep-web
+* keepproxy
+* arv-git-httpd
+* websockets
--- /dev/null
+---
+layout: default
+navsection: admin
+title: Metrics
+...
+
+{% comment %}
+Copyright (C) The Arvados Authors. All rights reserved.
+
+SPDX-License-Identifier: CC-BY-SA-3.0
+{% endcomment %}
+
+Metrics endpoints are found at @/status.json@ on many Arvados services. The purpose of metrics are to provide statistics about the operation of a service, suitable for diagnosing how well a service is performing under load.
+
+To access metrics endpoints, services must be configured with a "management token":management-token.html .
+
+h2. Keepstore
+
+h3. Root
+
+table(table table-bordered table-condensed).
+|_. Attribute|_. Type|_. Description|
+|Volumes| array of "volumeStatusEnt":#volumeStatusEnt ||
+|BufferPool| "PoolStatus":#PoolStatus ||
+|PullQueue| "WorkQueueStatus":#WorkQueueStatus ||
+|TrashQueue| "WorkQueueStatus":#WorkQueueStatus ||
+|RequestsCurrent| int ||
+|RequestsMax| int ||
+|Version| string ||
+
+h3(#volumeStatusEnt). volumeStatusEnt
+
+table(table table-bordered table-condensed).
+|_. Attribute|_. Type|_. Description|
+|Label| string||
+|Status| "VolumeStatus":#VolumeStatus ||
+|VolumeStats| "ioStats":#ioStats ||
+
+h3(#VolumeStatus). VolumeStatus
+
+table(table table-bordered table-condensed).
+|_. Attribute|_. Type|_. Description|
+|MountPoint| string||
+|DeviceNum| uint64||
+|BytesFree| uint64||
+|BytesUsed| uint64||
+
+h3(#ioStats). ioStats
+
+table(table table-bordered table-condensed).
+|_. Attribute|_. Type|_. Description|
+|Errors| uint64||
+|Ops| uint64||
+|CompareOps| uint64||
+|GetOps| uint64||
+|PutOps| uint64||
+|TouchOps| uint64||
+|InBytes| uint64||
+|OutBytes| uint64||
+
+h3(#PoolStatus). PoolStatus
+
+table(table table-bordered table-condensed).
+|_. Attribute|_. Type|_. Description|
+|BytesAllocatedCumulative| uint64||
+|BuffersMax| int||
+|BuffersInUse| int||
+
+h3(#WorkQueueStatus). WorkQueueStatus
+
+table(table table-bordered table-condensed).
+|_. Attribute|_. Type|_. Description|
+|InProgress| int||
+|Queued| int||
+
+h3. Example response
+
+<pre>
+{
+ "Volumes": [
+ {
+ "Label": "[UnixVolume /var/lib/arvados/keep0]",
+ "Status": {
+ "MountPoint": "/var/lib/arvados/keep0",
+ "DeviceNum": 65029,
+ "BytesFree": 222532972544,
+ "BytesUsed": 435456679936
+ },
+ "InternalStats": {
+ "Errors": 0,
+ "InBytes": 1111,
+ "OutBytes": 0,
+ "OpenOps": 1,
+ "StatOps": 4,
+ "FlockOps": 0,
+ "UtimesOps": 0,
+ "CreateOps": 0,
+ "RenameOps": 0,
+ "UnlinkOps": 0,
+ "ReaddirOps": 0
+ }
+ }
+ ],
+ "BufferPool": {
+ "BytesAllocatedCumulative": 67108864,
+ "BuffersMax": 20,
+ "BuffersInUse": 0
+ },
+ "PullQueue": {
+ "InProgress": 0,
+ "Queued": 0
+ },
+ "TrashQueue": {
+ "InProgress": 0,
+ "Queued": 0
+ },
+ "RequestsCurrent": 1,
+ "RequestsMax": 40,
+ "Version": "dev"
+}
+</pre>
+
+h2. Node manager
+
+The node manager status end point provides a snapshot of internal status at the time of the most recent wishlist update.
+
+table(table table-bordered table-condensed).
+|_. Attribute|_. Type|_. Description|
+|nodes_booting|int|Number of nodes in booting state|
+|nodes_unpaired|int|Number of nodes in unpaired state|
+|nodes_busy|int|Number of nodes in busy state|
+|nodes_idle|int|Number of nodes in idle state|
+|nodes_fail|int|Number of nodes in fail state|
+|nodes_down|int|Number of nodes in down state|
+|nodes_shutdown|int|Number of nodes in shutdown state|
+|nodes_wish|int|Number of nodes in the current wishlist|
+|node_quota|int|Current node count ceiling due to cloud quota limits|
+|config_max_nodes|int|Configured max node count|
+
+h3. Example
+
+<pre>
+{
+ "actor_exceptions": 0,
+ "idle_times": {
+ "compute1": 0,
+ "compute3": 0,
+ "compute2": 0,
+ "compute4": 0
+ },
+ "create_node_errors": 0,
+ "destroy_node_errors": 0,
+ "nodes_idle": 0,
+ "config_max_nodes": 8,
+ "list_nodes_errors": 0,
+ "node_quota": 8,
+ "Version": "1.1.4.20180719160944",
+ "nodes_wish": 0,
+ "nodes_unpaired": 0,
+ "nodes_busy": 4,
+ "boot_failures": 0
+}
+</pre>
!(full-width){{site.baseurl}}/images/Crunch_dispatch.svg!
+h2(#RAM). Understanding RAM requests for containers
+
+The @runtime_constraints@ section of a container specifies working RAM (@ram@) and Keep cache (@keep_cache_ram@). If not specified, containers get a default Keep cache (@container_default_keep_cache_ram@, default 256 MiB). The total RAM requested for a container is the sum of working RAM, Keep cache, and an additional RAM reservation configured by the admin (@ReserveExtraRAM@ in the dispatcher configuration, default zero).
+
+The total RAM request is used to schedule containers onto compute nodes. RAM allocation limits are enforced using kernel controls such as cgroups. A container which requests 1 GiB RAM will only be permitted to allocate up to 1 GiB of RAM, even if scheduled on a 4 GiB node. On HPC systems, a multi-core node may run multiple containers at a time.
+
+When running on the cloud, the memory request (along with CPU and disk) is used to select (and possibly boot) an instance type with adequate resources to run the container. Instance type RAM is derated 5% from the published specification to accomodate virtual machine, kernel and system services overhead.
+
+h3. Calculate minimum instance type RAM for a container
+
+ (RAM request + Keep cache + ReserveExtraRAM) * (100/95)
+
+For example, for a 3 GiB request, default Keep cache, and no extra RAM reserved:
+
+ (3072 + 256) * 1.0526 = 3494 MiB
+
+To run this container, the instance type must have a published RAM size of at least 3494 MiB.
+
+h3. Calculate the maximum requestable RAM for an instance type
+
+ (Instance type RAM * (95/100)) - Keep cache - ReserveExtraRAM
+
+For example, for a 3.75 GiB node, default Keep cache, and no extra RAM reserved:
+
+ (3840 * 0.95) - 256 = 3392 MiB
+
+To run on this instance type, the container can request at most 3392 MiB of working RAM.
+
h2. Job API (deprecated)
# To submit work, create a "job":{{site.baseurl}}/api/methods/jobs.html . If the same job has been submitted in the past, it will return an existing job in @Completed@ state.
---
layout: default
navsection: admin
-title: User management
+title: User management at the CLI
...
{% comment %}
Copyright (C) The Arvados Authors. All rights reserved.
tmpdirMin: 90000
</pre>
+* Available compute nodes types vary over time and across different cloud providers, so try to limit the RAM requirement to what the program actually needs. However, if you need to target a specific compute node type, see this discussion on "calculating RAM request and choosing instance type for containers.":{{site.baseurl}}/api/execution.html#RAM
+
* Instead of scattering separate steps, prefer to scatter over a subworkflow.
With the following pattern, @step1@ has to wait for all samples to complete before @step2@ can start computing on any samples. This means a single long-running sample can prevent the rest of the workflow from moving on:
"os"
"strconv"
"strings"
+ "syscall"
"time"
)
// Interval between samples. Must be positive.
PollPeriod time.Duration
+ // Temporary directory, will be monitored for available, used & total space.
+ TempDir string
+
// Where to write statistics. Must not be nil.
Logger *log.Logger
- reportedStatFile map[string]string
- lastNetSample map[string]ioSample
- lastDiskSample map[string]ioSample
- lastCPUSample cpuSample
+ reportedStatFile map[string]string
+ lastNetSample map[string]ioSample
+ lastDiskIOSample map[string]ioSample
+ lastCPUSample cpuSample
+ lastDiskSpaceSample diskSpaceSample
done chan struct{} // closed when we should stop reporting
flushed chan struct{} // closed when we have made our last report
continue
}
delta := ""
- if prev, ok := r.lastDiskSample[dev]; ok {
+ if prev, ok := r.lastDiskIOSample[dev]; ok {
delta = fmt.Sprintf(" -- interval %.4f seconds %d write %d read",
sample.sampleTime.Sub(prev.sampleTime).Seconds(),
sample.txBytes-prev.txBytes,
sample.rxBytes-prev.rxBytes)
}
r.Logger.Printf("blkio:%s %d write %d read%s\n", dev, sample.txBytes, sample.rxBytes, delta)
- r.lastDiskSample[dev] = sample
+ r.lastDiskIOSample[dev] = sample
}
}
}
}
+type diskSpaceSample struct {
+ hasData bool
+ sampleTime time.Time
+ total uint64
+ used uint64
+ available uint64
+}
+
+func (r *Reporter) doDiskSpaceStats() {
+ s := syscall.Statfs_t{}
+ err := syscall.Statfs(r.TempDir, &s)
+ if err != nil {
+ return
+ }
+ bs := uint64(s.Bsize)
+ nextSample := diskSpaceSample{
+ hasData: true,
+ sampleTime: time.Now(),
+ total: s.Blocks * bs,
+ used: (s.Blocks - s.Bfree) * bs,
+ available: s.Bavail * bs,
+ }
+
+ var delta string
+ if r.lastDiskSpaceSample.hasData {
+ prev := r.lastDiskSpaceSample
+ interval := nextSample.sampleTime.Sub(prev.sampleTime).Seconds()
+ delta = fmt.Sprintf(" -- interval %.4f seconds %d used",
+ interval,
+ int64(nextSample.used-prev.used))
+ }
+ r.Logger.Printf("statfs %d available %d used %d total%s\n",
+ nextSample.available, nextSample.used, nextSample.total, delta)
+ r.lastDiskSpaceSample = nextSample
+}
+
type cpuSample struct {
hasData bool // to distinguish the zero value from real data
sampleTime time.Time
}
r.lastNetSample = make(map[string]ioSample)
- r.lastDiskSample = make(map[string]ioSample)
+ r.lastDiskIOSample = make(map[string]ioSample)
+
+ if len(r.TempDir) == 0 {
+ // Temporary dir not provided, try to get it from the environment.
+ r.TempDir = os.Getenv("TMPDIR")
+ }
+ if len(r.TempDir) > 0 {
+ r.Logger.Printf("notice: monitoring temp dir %s\n", r.TempDir)
+ }
ticker := time.NewTicker(r.PollPeriod)
for {
r.doCPUStats()
r.doBlkIOStats()
r.doNetworkStats()
+ r.doDiskSpaceStats()
select {
case <-r.done:
return
my $cgroup_root = "/sys/fs/cgroup";
my $docker_bin = "docker.io";
my $docker_run_args = "";
+my $srun_sync_timeout = 15*60;
GetOptions('force-unlock' => \$force_unlock,
'git-dir=s' => \$git_dir,
'job=s' => \$jobspec,
'cgroup-root=s' => \$cgroup_root,
'docker-bin=s' => \$docker_bin,
'docker-run-args=s' => \$docker_run_args,
+ 'srun-sync-timeout=i' => \$srun_sync_timeout,
);
if (defined $job_api_token) {
my ($stdout_r, $stdout_w);
pipe $stdout_r, $stdout_w or croak("pipe() failed: $!");
+ my $started_srun = scalar time;
+
my $srunpid = fork();
if ($srunpid == 0)
{
if (!$busy) {
select(undef, undef, undef, 0.1);
}
+ if (($started_srun + $srun_sync_timeout) < scalar time) {
+ # Exceeded general timeout for "srun_sync" operations, likely
+ # means something got stuck on the remote node.
+ Log(undef, "srun_sync exceeded timeout, will fail.");
+ $main::please_freeze = 1;
+ }
killem(keys %proc) if $main::please_freeze;
}
my $exited = $?;
gem 'mocha', require: false
end
+# We need this dependency because of crunchv1
+gem 'arvados-cli'
+
# We'll need to update related code prior to Rails 5.
# See: https://github.com/rails/activerecord-deprecated_finders
gem 'activerecord-deprecated_finders', require: 'active_record/deprecated_finders'
i18n (~> 0)
json (>= 1.7.7, < 3)
jwt (>= 0.1.5, < 2)
+ arvados-cli (1.1.4.20180723133344)
+ activesupport (>= 3.2.13, < 5)
+ andand (~> 1.3, >= 1.3.3)
+ arvados (~> 0.1, >= 0.1.20150128223554)
+ curb (~> 0.8)
+ google-api-client (~> 0.6, >= 0.6.3, < 0.8.9)
+ json (>= 1.7.7, < 3)
+ oj (~> 3.0)
+ trollop (~> 2.0)
autoparse (0.3.3)
addressable (>= 2.3.1)
extlib (>= 0.9.15)
coffee-script-source (1.12.2)
concurrent-ruby (1.0.5)
crass (1.0.4)
+ curb (0.9.6)
database_cleaner (1.7.0)
erubis (2.7.0)
eventmachine (1.2.6)
acts_as_api
andand
arvados (>= 0.1.20150615153458)
+ arvados-cli
coffee-rails (~> 4.0)
database_cleaner
factory_girl_rails
uglifier (~> 2.0)
BUNDLED WITH
- 1.16.2
+ 1.16.3
INSERT INTO schema_migrations (version) VALUES ('20180514135529');
+INSERT INTO schema_migrations (version) VALUES ('20180607175050');
+
INSERT INTO schema_migrations (version) VALUES ('20180608123145');
-INSERT INTO schema_migrations (version) VALUES ('20180607175050');
@docker_bin = ENV['CRUNCH_JOB_DOCKER_BIN']
@docker_run_args = ENV['CRUNCH_JOB_DOCKER_RUN_ARGS']
@cgroup_root = ENV['CRUNCH_CGROUP_ROOT']
+ @srun_sync_timeout = ENV['CRUNCH_SRUN_SYNC_TIMEOUT']
@arvados_internal = Rails.configuration.git_internal_dir
if not File.exist? @arvados_internal
cmd_args += ['--docker-run-args', @docker_run_args]
end
+ if @srun_sync_timeout
+ cmd_args += ['--srun-sync-timeout', @srun_sync_timeout]
+ end
+
if have_job_lock?(job)
cmd_args << "--force-unlock"
end
CgroupParent: runner.expectCgroupParent,
CgroupRoot: runner.cgroupRoot,
PollPeriod: runner.statInterval,
+ TempDir: runner.parentTemp,
}
runner.statReporter.Start()
return nil
}
// Funnel stderr through our channel
- stderr_pipe, err := cmd.StderrPipe()
+ stderrPipe, err := cmd.StderrPipe()
if err != nil {
logger.Fatalln("error in StderrPipe:", err)
}
os.Stdin.Close()
os.Stdout.Close()
- copyPipeToChildLog(stderr_pipe, log.New(os.Stderr, "", 0))
+ copyPipeToChildLog(stderrPipe, log.New(os.Stderr, "", 0))
return cmd.Wait()
}
# a snapshot of internal state.
# Management server listening address (default 127.0.0.1)
-#address = 0.0.0.0
+address = 0.0.0.0
# Management server port number (default -1, server is disabled)
-#port = 8989
+port = 8989
+
+MangementToken = xxx
[Daemon]
# The dispatcher can customize the start and stop procedure for
fi
blob_signing_key=$(cat /var/lib/arvados/blob_signing_key)
+if ! test -s /var/lib/arvados/management_token ; then
+ ruby -e 'puts rand(2**400).to_s(36)' > /var/lib/arvados/management_token
+fi
+management_token=$(cat /var/lib/arvados/management_token)
+
# self signed key will be created by SSO server script.
test -s /var/lib/arvados/self-signed.key
default_collection_replication: 1
docker_image_formats: ["v2"]
keep_web_service_url: http://$localip:${services[keep-web]}/
+ ManagementToken: $management_token
EOF
(cd config && /usr/local/lib/arvbox/yml_override.py application.yml)
echo $UUID > /var/lib/arvados/$1-uuid
fi
+management_token=$(cat /var/lib/arvados/management_token)
+
set +e
killall -HUP keepproxy
-exec /usr/local/bin/keepstore \
- -listen=:$2 \
- -enforce-permissions=true \
- -blob-signing-key-file=/var/lib/arvados/blob_signing_key \
- -data-manager-token-file=/var/lib/arvados/superuser_token \
- -max-buffers=20 \
- -volume=/var/lib/arvados/$1
+cat >/var/lib/arvados/$1.yml <<EOF
+Listen: ":$2"
+BlobSigningKeyFile: /var/lib/arvados/blob_signing_key
+SystemAuthTokenFile: /var/lib/arvados/superuser_token
+ManagementToken: $management_token
+MaxBuffers: 20
+Volumes:
+ - Type: Directory
+ Root: /var/lib/arvados/$1
+EOF
+
+exec /usr/local/bin/keepstore -config=/var/lib/arvados/$1.yml
+++ /dev/null
-/usr/local/lib/arvbox/runsu.sh
\ No newline at end of file
--- /dev/null
+#!/bin/sh
+# Copyright (C) The Arvados Authors. All rights reserved.
+#
+# SPDX-License-Identifier: AGPL-3.0
+
+set -e
+
+/usr/local/lib/arvbox/runsu.sh $0-service
+sv stop doc
cd /usr/src/arvados/doc
bundle exec rake generate baseurl=http://$localip:${services[doc]} arvados_api_host=$localip:${services[controller-ssl]} arvados_workbench_host=http://$localip
-
-sv stop doc >/dev/null
+++ /dev/null
-/usr/local/lib/arvbox/runsu.sh
\ No newline at end of file
--- /dev/null
+#!/bin/sh
+# Copyright (C) The Arvados Authors. All rights reserved.
+#
+# SPDX-License-Identifier: AGPL-3.0
+
+set -e
+
+/usr/local/lib/arvbox/runsu.sh $0-service
+sv stop ready
echo "Workbench is running at http://$localip"
rm -r /tmp/arvbox-ready
-
-sv stop ready >/dev/null