Peter Amstutz [Fri, 5 Feb 2016 16:10:43 +0000 (11:10 -0500)]
7667: Combine polling logs into fewer lines for less noise. Adjust message
when last_ping_at is unexpectedly none to be less severe (can happen in
innocent circumstances). Report nodes in "booted" list as "booting" since they
are unpaired. Fix tests.
Brett Smith [Fri, 5 Feb 2016 09:52:43 +0000 (04:52 -0500)]
7868: Update API server's arvados-cli version.
Curoverse clusters are deployed by setting CRUNCH_JOB_BIN,
effectively excluding it from the bundle, but this is not true for
clusters deployed following the install guide. Out of the box,
they'll use the version of crunch-job that's actually in the
arvados-cli gem in the bundle.
crunch-dispatch has functionality in it that requires a newer
arvados-cli, so update accordingly. This is not exactly the version
produced by #7868, but it's pretty close.
I think there's a strong case that we should update this version
whenever we make a substantial change to crunch-job. But since I'm
pushing this without discussion or review, I'm doing the smallest
thing possible.
Tom Clegg [Mon, 1 Feb 2016 06:58:34 +0000 (01:58 -0500)]
8288: Add timeout option to close() method of event clients.
Previously in EventClient, close() didn't wait for anything. Now, if a
timeout is given, it waits for ws4py to call the closed() callback to
indicate the connection has closed.
Previously in PollClient, close() waited indefinitely for the polling
thread to terminate. This can take a very long time if, for example,
there are multiple subscriptions and the "get logs" API transaction is
slow.
The only apparent reason a caller would want to wait here at all is to
guarantee the simplifying assumption the on_event() callback is never
called after close(). Now, instead of letting the thread run until
all events are received and handled, PollClient achieves this the same
way EventClient does: ignore events that arrive after close().
Brett Smith [Thu, 4 Feb 2016 10:33:24 +0000 (05:33 -0500)]
Make install guide slurm.conf more Arvados-compliant.
* SelectType=select/linear allocates entire nodes at a time. The
previous value scheduled individual cores.
* With that change, SelectTypeParameters=CR_CPU_Memory is not valid.
Remove it, as we do in production.
* The setting of FastSchedule seems less pressing, but 0 is what we
use in production, so share that here too.
Changing from `/etc/ssl/certs` to `/etc/ssl/certs/ca-certificates.crt`
is safe, because add_trust_ca accepts either a directory with hashed
certs, or a file with multiple certs. On Debian, the latter path is a
single file built from the hashed certs in the former, so this is
functionally identical there, and more predictable on Red Hat (where I
don't know what it's doing).
Peter Amstutz [Tue, 2 Feb 2016 17:26:57 +0000 (12:26 -0500)]
6702: Refactor create_node to BaseComputeNodeDriver so logic also applies to
Azure. Adds new find_node() method; if returns None or raises an error,
re-raise the original create_node exception.
Brett Smith [Mon, 1 Feb 2016 17:43:04 +0000 (12:43 -0500)]
8014: Remove more upgrade script references from install guide.
The steps removed are now handled by Rails package postinst scripts.
This should've been done in 378a988bbf9e29736382339f587582259b641782,
but was overlooked. Refs #8014.
Brett Smith [Mon, 1 Feb 2016 16:51:14 +0000 (11:51 -0500)]
Fix install doc rendering of API Nginx config.
<notextile> doesn't actually nest like proper HTML, it's just a
boolean that remembers the last state. Turn it back on after doing an
include that turns it off. No issue #.
Brett Smith [Thu, 28 Jan 2016 00:02:05 +0000 (19:02 -0500)]
8005: Install guide uses runit packages on Red Hat.
The runit RPMs only provide /etc/service. The .debs provide /etc/sv
and /etc/service. Our understanding is that /etc/sv is for all
service definitions (akin to /etc/init.d), and /etc/service is for
service definitions that runit should start at boot (akin to
/etc/rcN.d). To provide uniformity, our install guide instructs users
to make /etc/sv if needed, and link it to /etc/service.
This commit could go farther. Today it would be best if all the runit
sections in the install guide followed Tom's modern template used for
arv-git-httpd and arvados-docker-cleaner. However, that will probably
require some creation and testing of log/run scripts, and some
adaptation of the run scripts to fit the template. I wish I could
include those improvements now, but unfortunately I'm out of time, so
they'll have to wait for another day.
radhika [Thu, 21 Jan 2016 20:25:06 +0000 (15:25 -0500)]
8178: All three currently supported volumes return error when trash-lifetime period is not configured. azure blob and s3 volumes are updated to do so.
Returning an error is causing test failures in unix volume and hence is still a work in progress.
radhika [Thu, 21 Jan 2016 20:25:06 +0000 (15:25 -0500)]
8178: All three currently supported volumes return error when trash-lifetime period is not configured. azure blob and s3 volumes are updated to do so.
Returning an error is causing test failures in unix volume and hence is still a work in progress.
Tom Clegg [Thu, 21 Jan 2016 19:35:34 +0000 (14:35 -0500)]
8281: Fix KeepClient retry bugs.
get() and put() were both handling all Curl exceptions -- including
timeouts -- by marking the keep service as unusable. For example, if a
single proxy is the only service available, a single timeout was
fatal. This is fixed by setting the retry loop status to None instead
of False after curl exceptions.
put() was repeating its retry loop until it achieved the desired
number of replicas _in a single iteration_. For example, when trying
to store 2 replicas, 6 loop iterations with a single success in each
iteration would result in 6 copies being stored but put() declaring
failure. This is fixed by checking against a cumulative "done" counter
instead of the "copies done in this loop iteration" counter.
Tom Clegg [Thu, 21 Jan 2016 09:01:16 +0000 (04:01 -0500)]
8281: Fix arv-mount ignoring --retries argument when writing file data.
"num_retries" arguments get passed around extensively in arvfile.py
and collection.py in the Python SDK, but ultimately the writing of
file data is done by a _BlockManager which doesn't have any way to
accept that argument or pass it along to a KeepClient, so PUT requests
always use the CollectionWriter's KeepClient's default num_retries.
In arv-mount's case, we have been telling CollectionWriter the
num_retries we want. When CollectionWriter creates a KeepClient,
num_retries gets passed along -- normally this works around the fact
that num_retries gets lost by the _BlockManager layer. However, we
provided our own KeepClient to use instead of letting CollectionWriter
create one, and we forgot to set num_retries on our own KeepClient, so
we weren't retrying PUT requests.
radhika [Thu, 21 Jan 2016 20:25:06 +0000 (15:25 -0500)]
8178: All three currently supported volumes return error when trash-lifetime period is not configured. azure blob and s3 volumes are updated to do so.
Returning an error is causing test failures in unix volume and hence is still a work in progress.