Brett Smith [Fri, 12 Feb 2016 23:38:36 +0000 (18:38 -0500)]
8203: crunch-job tempfails after failing to install a Docker image.
* Use `bash -o pipefail` to run the image installer shell script, to
reliably detect more failures.
* Exit EX_RETRY_UNLOCKED after a failure to install the image.
Refs #8203, because I'm cautiously optimistic that this will reduce
the incidence of the "can't find UID" problem.
Brett Smith [Wed, 10 Feb 2016 18:19:11 +0000 (13:19 -0500)]
Pin PySDK's gflags dependency to <3.0.
We've built and tested with 3.0 successfully, but its ChangeLog says:
* A lot of potentially backwards incompatible changes since 2.0.
* This version is NOT recommended to use in production. Some of the files and
documentation has been lost during export; this will be fixed in next
versions.
We found out about this after 3.0.2 broke our tests.
Take their advice for now. No issue #.
Brett Smith [Tue, 9 Feb 2016 22:10:08 +0000 (17:10 -0500)]
crunch-job detects more "io aborted" SLURM errors.
It's seemingly random whether SLURM reports "Aborting, io aborted and
missing step" or "Aborting, missing step and io aborted". Extend the
regexp to catch both. No issue #.
Peter Amstutz [Fri, 5 Feb 2016 16:10:43 +0000 (11:10 -0500)]
7667: Combine polling logs into fewer lines for less noise. Adjust message
when last_ping_at is unexpectedly none to be less severe (can happen in
innocent circumstances). Report nodes in "booted" list as "booting" since they
are unpaired. Fix tests.
Brett Smith [Fri, 5 Feb 2016 09:52:43 +0000 (04:52 -0500)]
7868: Update API server's arvados-cli version.
Curoverse clusters are deployed by setting CRUNCH_JOB_BIN,
effectively excluding it from the bundle, but this is not true for
clusters deployed following the install guide. Out of the box,
they'll use the version of crunch-job that's actually in the
arvados-cli gem in the bundle.
crunch-dispatch has functionality in it that requires a newer
arvados-cli, so update accordingly. This is not exactly the version
produced by #7868, but it's pretty close.
I think there's a strong case that we should update this version
whenever we make a substantial change to crunch-job. But since I'm
pushing this without discussion or review, I'm doing the smallest
thing possible.
Tom Clegg [Mon, 1 Feb 2016 06:58:34 +0000 (01:58 -0500)]
8288: Add timeout option to close() method of event clients.
Previously in EventClient, close() didn't wait for anything. Now, if a
timeout is given, it waits for ws4py to call the closed() callback to
indicate the connection has closed.
Previously in PollClient, close() waited indefinitely for the polling
thread to terminate. This can take a very long time if, for example,
there are multiple subscriptions and the "get logs" API transaction is
slow.
The only apparent reason a caller would want to wait here at all is to
guarantee the simplifying assumption the on_event() callback is never
called after close(). Now, instead of letting the thread run until
all events are received and handled, PollClient achieves this the same
way EventClient does: ignore events that arrive after close().
Brett Smith [Thu, 4 Feb 2016 10:33:24 +0000 (05:33 -0500)]
Make install guide slurm.conf more Arvados-compliant.
* SelectType=select/linear allocates entire nodes at a time. The
previous value scheduled individual cores.
* With that change, SelectTypeParameters=CR_CPU_Memory is not valid.
Remove it, as we do in production.
* The setting of FastSchedule seems less pressing, but 0 is what we
use in production, so share that here too.
Changing from `/etc/ssl/certs` to `/etc/ssl/certs/ca-certificates.crt`
is safe, because add_trust_ca accepts either a directory with hashed
certs, or a file with multiple certs. On Debian, the latter path is a
single file built from the hashed certs in the former, so this is
functionally identical there, and more predictable on Red Hat (where I
don't know what it's doing).
Peter Amstutz [Tue, 2 Feb 2016 17:26:57 +0000 (12:26 -0500)]
6702: Refactor create_node to BaseComputeNodeDriver so logic also applies to
Azure. Adds new find_node() method; if returns None or raises an error,
re-raise the original create_node exception.
Brett Smith [Mon, 1 Feb 2016 17:43:04 +0000 (12:43 -0500)]
8014: Remove more upgrade script references from install guide.
The steps removed are now handled by Rails package postinst scripts.
This should've been done in 378a988bbf9e29736382339f587582259b641782,
but was overlooked. Refs #8014.
Brett Smith [Mon, 1 Feb 2016 16:51:14 +0000 (11:51 -0500)]
Fix install doc rendering of API Nginx config.
<notextile> doesn't actually nest like proper HTML, it's just a
boolean that remembers the last state. Turn it back on after doing an
include that turns it off. No issue #.
Brett Smith [Thu, 28 Jan 2016 00:02:05 +0000 (19:02 -0500)]
8005: Install guide uses runit packages on Red Hat.
The runit RPMs only provide /etc/service. The .debs provide /etc/sv
and /etc/service. Our understanding is that /etc/sv is for all
service definitions (akin to /etc/init.d), and /etc/service is for
service definitions that runit should start at boot (akin to
/etc/rcN.d). To provide uniformity, our install guide instructs users
to make /etc/sv if needed, and link it to /etc/service.
This commit could go farther. Today it would be best if all the runit
sections in the install guide followed Tom's modern template used for
arv-git-httpd and arvados-docker-cleaner. However, that will probably
require some creation and testing of log/run scripts, and some
adaptation of the run scripts to fit the template. I wish I could
include those improvements now, but unfortunately I'm out of time, so
they'll have to wait for another day.
radhika [Thu, 21 Jan 2016 20:25:06 +0000 (15:25 -0500)]
8178: All three currently supported volumes return error when trash-lifetime period is not configured. azure blob and s3 volumes are updated to do so.
Returning an error is causing test failures in unix volume and hence is still a work in progress.