Tom Clegg [Wed, 10 Feb 2016 17:10:53 +0000 (12:10 -0500)]
8341: Fall back to live logs if log collection is saved but missing.
Tom Clegg [Wed, 10 Feb 2016 16:14:54 +0000 (11:14 -0500)]
8341: Update test results.
Tom Clegg [Wed, 10 Feb 2016 03:44:02 +0000 (22:44 -0500)]
8341: Retrieve only the log attributes that actually get used.
Tom Clegg [Tue, 9 Feb 2016 18:53:38 +0000 (13:53 -0500)]
8341: In pipeline mode, process all jobs concurrently.
Tom Clegg [Tue, 9 Feb 2016 16:49:35 +0000 (11:49 -0500)]
8341: Include Keep network activity in net stats.
Tom Clegg [Tue, 9 Feb 2016 15:27:47 +0000 (10:27 -0500)]
8341: Fix up debug labels. Avoid deadlock after exceptions in thread.
Tom Clegg [Mon, 8 Feb 2016 20:47:02 +0000 (15:47 -0500)]
8341: Do not round up Y axis to even numbers, just use max series value.
Remove Y axis labels (so X axis matches other graphs from the same
job), add grid lines.
Tom Clegg [Mon, 8 Feb 2016 14:56:10 +0000 (09:56 -0500)]
8341: Use "time since job start", not "time since task start", as X axis.
Tom Clegg [Wed, 10 Feb 2016 15:51:33 +0000 (10:51 -0500)]
Merge branch '8284-fix-slurm-queue-timestamp-check' closes #8284
Tom Clegg [Wed, 10 Feb 2016 15:22:07 +0000 (10:22 -0500)]
Emit log when installing docker image.
Avoids creating the illusion that "clean work dirs" is taking forever.
No issue #
Tom Clegg [Wed, 10 Feb 2016 15:08:57 +0000 (10:08 -0500)]
Merge branch '7263-better-busy-behavior' refs #7263
Tom Clegg [Fri, 22 Jan 2016 20:02:21 +0000 (15:02 -0500)]
7263: Avoid getting stuck processing stderr for one task for a long time.
Do not sleep(0.1) unless pipes are idle.
Brett Smith [Tue, 9 Feb 2016 22:10:08 +0000 (17:10 -0500)]
crunch-job detects more "io aborted" SLURM errors.
It's seemingly random whether SLURM reports "Aborting, io aborted and
missing step" or "Aborting, missing step and io aborted". Extend the
regexp to catch both. No issue #.
Brett Smith [Tue, 9 Feb 2016 21:42:59 +0000 (16:42 -0500)]
Merge branch '8406-tempfail-after-retry-unlocked'
Closes #8406, #8407.
Brett Smith [Tue, 9 Feb 2016 21:42:12 +0000 (16:42 -0500)]
8406: Update comment to match new code.
Peter Amstutz [Tue, 9 Feb 2016 21:25:45 +0000 (16:25 -0500)]
8406: @job_retry_counts.include? jobrecord.uuid because @job_retry_counts has a default value.
Peter Amstutz [Tue, 9 Feb 2016 20:53:13 +0000 (15:53 -0500)]
8406: Treat EXIT_TEMPFAIL as EXIT_RETRY_UNLOCKED if we have previously gotten
EXIT_RETRY_UNLOCKED (because the job is now in "Running" state.)
Nico Cesar [Tue, 9 Feb 2016 21:34:34 +0000 (16:34 -0500)]
I "fonud" a typo
no issue #
Peter Amstutz [Tue, 9 Feb 2016 17:32:42 +0000 (12:32 -0500)]
Merge branch '8404-catch-interrupted-syscall' closes #8404
Peter Amstutz [Tue, 9 Feb 2016 17:31:07 +0000 (12:31 -0500)]
8404: Adjust try block to just surround os.wait().
Peter Amstutz [Tue, 9 Feb 2016 16:41:13 +0000 (11:41 -0500)]
8404: catch and continue from interrupted system call from os.wait()
Tom Clegg [Mon, 8 Feb 2016 21:09:28 +0000 (16:09 -0500)]
Fix nodemanager test race. No issue #
Tom Clegg [Mon, 8 Feb 2016 19:32:45 +0000 (14:32 -0500)]
Merge branch '8341-live-crunchstat-summary' refs #8341
Tom Clegg [Mon, 8 Feb 2016 19:18:42 +0000 (14:18 -0500)]
8341: Use a Queue of lines and one thread, instead of a succession of threads and a deque of buffers.
Tom Clegg [Mon, 8 Feb 2016 01:19:45 +0000 (20:19 -0500)]
8341: Move reader classes to reader.py.
Tom Clegg [Mon, 8 Feb 2016 01:15:00 +0000 (20:15 -0500)]
8341: Use a worker thread to get page N+1 of logs while parsing page N.
Tom Clegg [Mon, 8 Feb 2016 00:43:02 +0000 (19:43 -0500)]
8341: Get job log from logs API if the log has not been written to Keep yet.
Tom Clegg [Mon, 8 Feb 2016 19:29:03 +0000 (14:29 -0500)]
Merge branch '8289-no-extra-orders' closes #8289
Tom Clegg [Mon, 8 Feb 2016 19:28:02 +0000 (14:28 -0500)]
8289: Strip redundant orders, even when provided explicitly by client.
Tom Clegg [Sat, 23 Jan 2016 05:23:49 +0000 (00:23 -0500)]
8289: Do not add fallback orders if client already specified an unambiguous order.
Peter Amstutz [Mon, 8 Feb 2016 16:28:53 +0000 (11:28 -0500)]
Merge branch '7667-node-manager-logging' refs #7667
Peter Amstutz [Mon, 8 Feb 2016 16:28:11 +0000 (11:28 -0500)]
7667: Store node size in a table so to avoid blocking on booting and shutdown
actors to ask node size.
Peter Amstutz [Mon, 8 Feb 2016 03:52:51 +0000 (22:52 -0500)]
7667: Fix log message
Tom Clegg [Sat, 6 Feb 2016 00:45:30 +0000 (19:45 -0500)]
Merge branch '8285-fuse-subscribe-websockets' closes #8285
Tom Clegg [Sat, 6 Feb 2016 00:39:42 +0000 (19:39 -0500)]
8285: Test that arvados.events.subscribe() is called only when needed.
Add missing TagsDirectory.want_event_subscribe().
Peter Amstutz [Sat, 6 Feb 2016 00:17:42 +0000 (19:17 -0500)]
8285: Add test for listen_for_events
Peter Amstutz [Fri, 5 Feb 2016 21:39:25 +0000 (16:39 -0500)]
8285: Add want_event_subscribe flag to subclasses of fusedir.Directory,
determine whether to call listen_for_events based on it.
Peter Amstutz [Fri, 5 Feb 2016 16:10:43 +0000 (11:10 -0500)]
7667: Combine polling logs into fewer lines for less noise. Adjust message
when last_ping_at is unexpectedly none to be less severe (can happen in
innocent circumstances). Report nodes in "booted" list as "booting" since they
are unpaired. Fix tests.
Brett Smith [Fri, 5 Feb 2016 09:52:43 +0000 (04:52 -0500)]
7868: Update API server's arvados-cli version.
Curoverse clusters are deployed by setting CRUNCH_JOB_BIN,
effectively excluding it from the bundle, but this is not true for
clusters deployed following the install guide. Out of the box,
they'll use the version of crunch-job that's actually in the
arvados-cli gem in the bundle.
crunch-dispatch has functionality in it that requires a newer
arvados-cli, so update accordingly. This is not exactly the version
produced by #7868, but it's pretty close.
I think there's a strong case that we should update this version
whenever we make a substantial change to crunch-job. But since I'm
pushing this without discussion or review, I'm doing the smallest
thing possible.
Refs #7868.
Peter Amstutz [Thu, 4 Feb 2016 23:46:31 +0000 (18:46 -0500)]
7667: Node manager bug fixes and logging improvements.
* ComputeNodeSetupActor will now finish if there is an unhandled exception.
* ComputeNodeMonitorActor now explains why a node that is in the shutdown window
is not eligible for shutdown.
* Logging in nodes_wanted now distinguishes idle/busy/booting/shutting down.
* Logging by actors is now class name and a portion of the actor urn, so actions
of a specific actor can be consistently identified.
Tom Clegg [Thu, 4 Feb 2016 19:29:39 +0000 (14:29 -0500)]
Recognize another way slurm tells us about node failures.
Retry, instead of giving up, in situations like this:
2016-02-02_08:42:26 wx7k5-8i9sb-guk2lv53z3572dc 40682 3 stderr srun: error: Aborting, io error and missing step on node 0
2016-02-02_08:42:26 wx7k5-8i9sb-guk2lv53z3572dc 40682 3 stderr srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
2016-02-02_08:42:28 wx7k5-8i9sb-guk2lv53z3572dc 40682 3 stderr srun: error: Timed out waiting for job step to complete
2016-02-02_08:42:28 wx7k5-8i9sb-guk2lv53z3572dc 40682 3 child 42984 on compute26.1 exit 0 success=
2016-02-02_08:42:28 wx7k5-8i9sb-guk2lv53z3572dc 40682 3 ERROR: Task process exited 0, but never updated its task record to indicate success and record its output.
2016-02-02_08:42:28 wx7k5-8i9sb-guk2lv53z3572dc 40682 3 failure (#1, permanent) after 560 seconds
2016-02-02_08:42:28 wx7k5-8i9sb-guk2lv53z3572dc 40682 3 task output (0 bytes):
No issue #
Tom Clegg [Thu, 4 Feb 2016 18:17:31 +0000 (13:17 -0500)]
Merge branch '8288-poll-client-close-timeout' refs #8288
Tom Clegg [Mon, 1 Feb 2016 06:58:34 +0000 (01:58 -0500)]
8288: Add timeout option to close() method of event clients.
Previously in EventClient, close() didn't wait for anything. Now, if a
timeout is given, it waits for ws4py to call the closed() callback to
indicate the connection has closed.
Previously in PollClient, close() waited indefinitely for the polling
thread to terminate. This can take a very long time if, for example,
there are multiple subscriptions and the "get logs" API transaction is
slow.
The only apparent reason a caller would want to wait here at all is to
guarantee the simplifying assumption the on_event() callback is never
called after close(). Now, instead of letting the thread run until
all events are received and handled, PollClient achieves this the same
way EventClient does: ignore events that arrive after close().
Brett Smith [Thu, 4 Feb 2016 10:33:24 +0000 (05:33 -0500)]
Make install guide slurm.conf more Arvados-compliant.
* SelectType=select/linear allocates entire nodes at a time. The
previous value scheduled individual cores.
* With that change, SelectTypeParameters=CR_CPU_Memory is not valid.
Remove it, as we do in production.
* The setting of FastSchedule seems less pressing, but 0 is what we
use in production, so share that here too.
No issue #.
Peter Amstutz [Wed, 3 Feb 2016 22:51:46 +0000 (17:51 -0500)]
Try to make logging identify the actor consistently
Peter Amstutz [Wed, 3 Feb 2016 20:54:18 +0000 (15:54 -0500)]
Merge branch '6702-gce-node-create-fix' closes #6702
Tom Clegg [Wed, 3 Feb 2016 17:51:54 +0000 (12:51 -0500)]
Merge branch '8288-arv-mount-deadlock' refs #8288
Tom Clegg [Tue, 2 Feb 2016 21:46:35 +0000 (16:46 -0500)]
8288: Do not call operations.destroy() as a last resort, just abandon the llfuse thread.
Tom Clegg [Mon, 1 Feb 2016 08:01:31 +0000 (03:01 -0500)]
8288: Add test case for --exec mode.
Tom Clegg [Mon, 1 Feb 2016 02:43:30 +0000 (21:43 -0500)]
8288: Give fusermount -u a chance to work before resorting to operations.destroy().
Log a warning when resorting to operations.destroy().
De-duplicate setup/teardown code so more of the --exec code path is exercised in tests.
Tom Clegg [Wed, 3 Feb 2016 17:50:31 +0000 (12:50 -0500)]
8123: Install chartjs.js asset file.
...during "setup.py install" too, not just when installing via
package.
refs #8123
Brett Smith [Wed, 3 Feb 2016 11:42:17 +0000 (06:42 -0500)]
Improve install guide Nginx+SCL integration.
No issue #.
Brett Smith [Wed, 3 Feb 2016 11:26:32 +0000 (06:26 -0500)]
login-sync gets user's home from /etc/passwd.
No issue #.
Brett Smith [Wed, 3 Feb 2016 10:37:42 +0000 (05:37 -0500)]
Workbench loads CA certs on Red Hat.
This has the same rationale and logic as #6432 and
9b910084faf3db6fa2071af604620e7d45d12a6c, applied to Workbench.
Changing from `/etc/ssl/certs` to `/etc/ssl/certs/ca-certificates.crt`
is safe, because add_trust_ca accepts either a directory with hashed
certs, or a file with multiple certs. On Debian, the latter path is a
single file built from the hashed certs in the former, so this is
functionally identical there, and more predictable on Red Hat (where I
don't know what it's doing).
No issue #.
Brett Smith [Wed, 3 Feb 2016 09:53:04 +0000 (04:53 -0500)]
Add fuse dependency to FUSE driver package.
When the fuse tools aren't installed, attempting to run arv-mount
fails with "fuse: failed to exec fusermount".
No issue #.
Brett Smith [Wed, 3 Feb 2016 09:39:27 +0000 (04:39 -0500)]
Add curl library dependency to shell install guide.
No isse #.
Brett Smith [Wed, 3 Feb 2016 09:32:39 +0000 (04:32 -0500)]
SLURM install guide notes slurm.conf path on Red Hat.
No issue #.
Brett Smith [Wed, 3 Feb 2016 09:26:49 +0000 (04:26 -0500)]
Add missing ; in keepproxy Nginx config.
No issue #.
Peter Amstutz [Tue, 2 Feb 2016 17:26:57 +0000 (12:26 -0500)]
6702: Refactor create_node to BaseComputeNodeDriver so logic also applies to
Azure. Adds new find_node() method; if returns None or raises an error,
re-raise the original create_node exception.
Peter Amstutz [Tue, 2 Feb 2016 16:31:15 +0000 (11:31 -0500)]
Merge branch '6702-gce-node-create-fix' closes #6702
Peter Amstutz [Tue, 2 Feb 2016 16:05:50 +0000 (11:05 -0500)]
Merge branch 'fix/build-python-llfuse-version' of https://github.com/wtsi-hgi/arvados
no issue #
Peter Amstutz [Tue, 2 Feb 2016 15:56:13 +0000 (10:56 -0500)]
Merge branch 'master' into 6702-gce-node-create-fix
Peter Amstutz [Tue, 2 Feb 2016 15:55:58 +0000 (10:55 -0500)]
Merge branch '8206-gce-retry-init' closes #8206
Peter Amstutz [Tue, 2 Feb 2016 15:55:39 +0000 (10:55 -0500)]
8206: Mock time.sleep() to avoid unnecessary delay in test.
Joshua Randall [Tue, 2 Feb 2016 15:45:46 +0000 (15:45 +0000)]
pins python-llfuse version to 0.41.1 for fpm on all platforms
Peter Amstutz [Tue, 2 Feb 2016 15:03:39 +0000 (10:03 -0500)]
8206: Refactor _retry to RetryMixin. Make retry timing consistent.
Brett Smith [Tue, 2 Feb 2016 12:23:10 +0000 (07:23 -0500)]
8005: Install guide suggests slurm-munge on Red Hat SLURM nodes.
This package includes the SLURM plugins that talk to MUNGE.
Refs #8005.
Peter Amstutz [Mon, 1 Feb 2016 19:54:28 +0000 (14:54 -0500)]
6702: Catch GCE create_node() errors and check if the node was actually
created. Added test.
Brett Smith [Mon, 1 Feb 2016 17:43:04 +0000 (12:43 -0500)]
8014: Remove more upgrade script references from install guide.
The steps removed are now handled by Rails package postinst scripts.
This should've been done in
378a988bbf9e29736382339f587582259b641782,
but was overlooked. Refs #8014.
Brett Smith [Mon, 1 Feb 2016 16:53:29 +0000 (11:53 -0500)]
Refresh Gitolite install guide.
* Tested instructions still work with 3.6.4. So noted.
* Prefer cloning Gitolite over HTTPS, since that's less likely to be
firewalled.
No issue #.
Brett Smith [Mon, 1 Feb 2016 16:51:14 +0000 (11:51 -0500)]
Fix install doc rendering of API Nginx config.
<notextile> doesn't actually nest like proper HTML, it's just a
boolean that remembers the last state. Turn it back on after doing an
include that turns it off. No issue #.
Peter Amstutz [Mon, 1 Feb 2016 14:14:41 +0000 (09:14 -0500)]
Pin llfuse to 0.41.1 because 0.42 came out and broke things. no issue #
Brett Smith [Fri, 29 Jan 2016 00:38:04 +0000 (19:38 -0500)]
Merge branch '8005-centos-3rdparty-installs-wip'
Closes #8005, #8135.
Brett Smith [Fri, 29 Jan 2016 00:27:13 +0000 (19:27 -0500)]
8005: Add tar Ruby build dependency on CentOS 6.
Brett Smith [Thu, 28 Jan 2016 00:02:05 +0000 (19:02 -0500)]
8005: Install guide uses runit packages on Red Hat.
The runit RPMs only provide /etc/service. The .debs provide /etc/sv
and /etc/service. Our understanding is that /etc/sv is for all
service definitions (akin to /etc/init.d), and /etc/service is for
service definitions that runit should start at boot (akin to
/etc/rcN.d). To provide uniformity, our install guide instructs users
to make /etc/sv if needed, and link it to /etc/service.
This commit could go farther. Today it would be best if all the runit
sections in the install guide followed Tom's modern template used for
arv-git-httpd and arvados-docker-cleaner. However, that will probably
require some creation and testing of log/run scripts, and some
adaptation of the run scripts to fit the template. I wish I could
include those improvements now, but unfortunately I'm out of time, so
they'll have to wait for another day.
Brett Smith [Thu, 28 Jan 2016 00:08:33 +0000 (19:08 -0500)]
8005: Install guide gets SLURM and MUNGE from RPMs.
Brett Smith [Wed, 27 Jan 2016 23:54:57 +0000 (18:54 -0500)]
8005: Fix bad Textile markup in compute node install guide.
The switch dashes created strikethrough for much of the notebox.
Brett Smith [Wed, 27 Jan 2016 20:15:23 +0000 (15:15 -0500)]
8005: Document installing Git on CentOS 6 from RepoForge.
Brett Smith [Wed, 27 Jan 2016 20:00:17 +0000 (15:00 -0500)]
8005: DRY up PostgreSQL password auth instructions on CentOS 6.
Ward Vandewege [Thu, 28 Jan 2016 19:32:00 +0000 (14:32 -0500)]
Make our API server packages for debian-based distributions depend on
libcurl-ssl-dev rather than libcurl4-openssl-dev.
No issue #
radhika [Tue, 26 Jan 2016 17:10:04 +0000 (12:10 -0500)]
closes #8198
Merge branch '8198-node-ip-address'
radhika [Tue, 26 Jan 2016 17:09:37 +0000 (12:09 -0500)]
Merge branch 'master' into 8198-node-ip-address
radhika [Tue, 26 Jan 2016 17:08:21 +0000 (12:08 -0500)]
refs #8178
Merge branch '8178-keepstore-trash-interface'
radhika [Tue, 26 Jan 2016 15:41:00 +0000 (10:41 -0500)]
Merge branch '8178-keepstore-trash-interface' of git.curoverse.com:arvados into 8178-keepstore-trash-interface
Conflicts:
services/keepstore/handlers.go
services/keepstore/volume_test.go
radhika [Tue, 26 Jan 2016 15:38:28 +0000 (10:38 -0500)]
8178: untrash should fail when ErrNotImplemented is returned.
radhika [Fri, 22 Jan 2016 22:37:15 +0000 (17:37 -0500)]
8178: (for now) all volumes must return ErrNotImplemented if trash-lifetime != 0
radhika [Thu, 21 Jan 2016 20:25:06 +0000 (15:25 -0500)]
8178: All three currently supported volumes return error when trash-lifetime period is not configured. azure blob and s3 volumes are updated to do so.
Returning an error is causing test failures in unix volume and hence is still a work in progress.
radhika [Thu, 21 Jan 2016 18:59:36 +0000 (13:59 -0500)]
8178: rename Delete api as Trash; add Untrash to volume interface; add UndeleteHandler and test for this endpoint.
Peter Amstutz [Mon, 25 Jan 2016 22:02:40 +0000 (17:02 -0500)]
8206: Add test to support retry on create_driver.
Tom Clegg [Mon, 25 Jan 2016 21:08:14 +0000 (16:08 -0500)]
Merge branch '8123-crunchstat-graphs' closes #8123
Tom Clegg [Mon, 25 Jan 2016 21:05:56 +0000 (16:05 -0500)]
8123: Escape HTML chars in page title.
Peter Amstutz [Mon, 25 Jan 2016 20:36:34 +0000 (15:36 -0500)]
8206: Refactor _retry into common function wrapper usable by both dispatch and
compute drivers.
Tom Clegg [Mon, 25 Jan 2016 06:16:44 +0000 (01:16 -0500)]
8123: Explain existing_constraints and use a proper instance variable.
Tom Clegg [Mon, 25 Jan 2016 06:08:27 +0000 (01:08 -0500)]
8123: Fix accidental old-style class.
Tom Clegg [Mon, 25 Jan 2016 06:00:03 +0000 (01:00 -0500)]
8123: Fix type check to accommodate unicode.
Tom Clegg [Mon, 25 Jan 2016 05:59:46 +0000 (00:59 -0500)]
8123: Use -v,-vv instead of --verbose,--debug.
Tom Clegg [Mon, 25 Jan 2016 02:07:42 +0000 (21:07 -0500)]
8123: Change --include-child-jobs to --skip-child-jobs (default False).
Tom Clegg [Mon, 25 Jan 2016 02:06:48 +0000 (21:06 -0500)]
8123: Explain mysterious memory constraint logic.
Tom Clegg [Mon, 25 Jan 2016 02:05:28 +0000 (21:05 -0500)]
8123: Update test dependencies.
Tom Clegg [Mon, 25 Jan 2016 00:48:06 +0000 (19:48 -0500)]
8284: Fix confusion between %proc and %jobstep.
$proc{$pid}->{jobstep} is an index into @jobstep
$proc{$pid}->{jobstepname} is the name we told srun to use
$proc{$pid}->{killtime} is a deadline when we should kill the process
$jobstep[$jobstepid]->{stderr_at} is the time of last stderr received
We were mistakenly using $proc->{$pid}->{stderr_at}, which was always
undef and therefore always less than $last_squeue_check. This resulted
in jobs being killed as "slurm orphans" when the real reason they
hadn't been returned by waitpid() was that we hadn't finished
consuming their stderr yet.