arvados.git
8 years ago8341: Fall back to live logs if log collection is saved but missing.
Tom Clegg [Wed, 10 Feb 2016 17:10:53 +0000 (12:10 -0500)]
8341: Fall back to live logs if log collection is saved but missing.

8 years ago8341: Update test results.
Tom Clegg [Wed, 10 Feb 2016 16:14:54 +0000 (11:14 -0500)]
8341: Update test results.

8 years ago8341: Retrieve only the log attributes that actually get used.
Tom Clegg [Wed, 10 Feb 2016 03:44:02 +0000 (22:44 -0500)]
8341: Retrieve only the log attributes that actually get used.

8 years ago8341: In pipeline mode, process all jobs concurrently.
Tom Clegg [Tue, 9 Feb 2016 18:53:38 +0000 (13:53 -0500)]
8341: In pipeline mode, process all jobs concurrently.

8 years ago8341: Include Keep network activity in net stats.
Tom Clegg [Tue, 9 Feb 2016 16:49:35 +0000 (11:49 -0500)]
8341: Include Keep network activity in net stats.

8 years ago8341: Fix up debug labels. Avoid deadlock after exceptions in thread.
Tom Clegg [Tue, 9 Feb 2016 15:27:47 +0000 (10:27 -0500)]
8341: Fix up debug labels. Avoid deadlock after exceptions in thread.

8 years ago8341: Do not round up Y axis to even numbers, just use max series value.
Tom Clegg [Mon, 8 Feb 2016 20:47:02 +0000 (15:47 -0500)]
8341: Do not round up Y axis to even numbers, just use max series value.

Remove Y axis labels (so X axis matches other graphs from the same
job), add grid lines.

8 years ago8341: Use "time since job start", not "time since task start", as X axis.
Tom Clegg [Mon, 8 Feb 2016 14:56:10 +0000 (09:56 -0500)]
8341: Use "time since job start", not "time since task start", as X axis.

8 years agoMerge branch '8284-fix-slurm-queue-timestamp-check' closes #8284
Tom Clegg [Wed, 10 Feb 2016 15:51:33 +0000 (10:51 -0500)]
Merge branch '8284-fix-slurm-queue-timestamp-check' closes #8284

8 years agoEmit log when installing docker image.
Tom Clegg [Wed, 10 Feb 2016 15:22:07 +0000 (10:22 -0500)]
Emit log when installing docker image.

Avoids creating the illusion that "clean work dirs" is taking forever.

No issue #

8 years agoMerge branch '7263-better-busy-behavior' refs #7263
Tom Clegg [Wed, 10 Feb 2016 15:08:57 +0000 (10:08 -0500)]
Merge branch '7263-better-busy-behavior' refs #7263

8 years ago7263: Avoid getting stuck processing stderr for one task for a long time.
Tom Clegg [Fri, 22 Jan 2016 20:02:21 +0000 (15:02 -0500)]
7263: Avoid getting stuck processing stderr for one task for a long time.

Do not sleep(0.1) unless pipes are idle.

8 years agocrunch-job detects more "io aborted" SLURM errors.
Brett Smith [Tue, 9 Feb 2016 22:10:08 +0000 (17:10 -0500)]
crunch-job detects more "io aborted" SLURM errors.

It's seemingly random whether SLURM reports "Aborting, io aborted and
missing step" or "Aborting, missing step and io aborted".  Extend the
regexp to catch both.  No issue #.

8 years agoMerge branch '8406-tempfail-after-retry-unlocked'
Brett Smith [Tue, 9 Feb 2016 21:42:59 +0000 (16:42 -0500)]
Merge branch '8406-tempfail-after-retry-unlocked'

Closes #8406, #8407.

8 years ago8406: Update comment to match new code.
Brett Smith [Tue, 9 Feb 2016 21:42:12 +0000 (16:42 -0500)]
8406: Update comment to match new code.

8 years ago8406: @job_retry_counts.include? jobrecord.uuid because @job_retry_counts has a defau...
Peter Amstutz [Tue, 9 Feb 2016 21:25:45 +0000 (16:25 -0500)]
8406: @job_retry_counts.include? jobrecord.uuid because @job_retry_counts has a default value.

8 years ago8406: Treat EXIT_TEMPFAIL as EXIT_RETRY_UNLOCKED if we have previously gotten
Peter Amstutz [Tue, 9 Feb 2016 20:53:13 +0000 (15:53 -0500)]
8406: Treat EXIT_TEMPFAIL as EXIT_RETRY_UNLOCKED if we have previously gotten
EXIT_RETRY_UNLOCKED (because the job is now in "Running" state.)

8 years agoI "fonud" a typo
Nico Cesar [Tue, 9 Feb 2016 21:34:34 +0000 (16:34 -0500)]
I "fonud" a typo

no issue #

8 years agoMerge branch '8404-catch-interrupted-syscall' closes #8404
Peter Amstutz [Tue, 9 Feb 2016 17:32:42 +0000 (12:32 -0500)]
Merge branch '8404-catch-interrupted-syscall' closes #8404

8 years ago8404: Adjust try block to just surround os.wait().
Peter Amstutz [Tue, 9 Feb 2016 17:31:07 +0000 (12:31 -0500)]
8404: Adjust try block to just surround os.wait().

8 years ago8404: catch and continue from interrupted system call from os.wait()
Peter Amstutz [Tue, 9 Feb 2016 16:41:13 +0000 (11:41 -0500)]
8404: catch and continue from interrupted system call from os.wait()

8 years agoFix nodemanager test race. No issue #
Tom Clegg [Mon, 8 Feb 2016 21:09:28 +0000 (16:09 -0500)]
Fix nodemanager test race. No issue #

8 years agoMerge branch '8341-live-crunchstat-summary' refs #8341
Tom Clegg [Mon, 8 Feb 2016 19:32:45 +0000 (14:32 -0500)]
Merge branch '8341-live-crunchstat-summary' refs #8341

8 years ago8341: Use a Queue of lines and one thread, instead of a succession of threads and...
Tom Clegg [Mon, 8 Feb 2016 19:18:42 +0000 (14:18 -0500)]
8341: Use a Queue of lines and one thread, instead of a succession of threads and a deque of buffers.

8 years ago8341: Move reader classes to reader.py.
Tom Clegg [Mon, 8 Feb 2016 01:19:45 +0000 (20:19 -0500)]
8341: Move reader classes to reader.py.

8 years ago8341: Use a worker thread to get page N+1 of logs while parsing page N.
Tom Clegg [Mon, 8 Feb 2016 01:15:00 +0000 (20:15 -0500)]
8341: Use a worker thread to get page N+1 of logs while parsing page N.

8 years ago8341: Get job log from logs API if the log has not been written to Keep yet.
Tom Clegg [Mon, 8 Feb 2016 00:43:02 +0000 (19:43 -0500)]
8341: Get job log from logs API if the log has not been written to Keep yet.

8 years agoMerge branch '8289-no-extra-orders' closes #8289
Tom Clegg [Mon, 8 Feb 2016 19:29:03 +0000 (14:29 -0500)]
Merge branch '8289-no-extra-orders' closes #8289

8 years ago8289: Strip redundant orders, even when provided explicitly by client.
Tom Clegg [Mon, 8 Feb 2016 19:28:02 +0000 (14:28 -0500)]
8289: Strip redundant orders, even when provided explicitly by client.

8 years ago8289: Do not add fallback orders if client already specified an unambiguous order.
Tom Clegg [Sat, 23 Jan 2016 05:23:49 +0000 (00:23 -0500)]
8289: Do not add fallback orders if client already specified an unambiguous order.

8 years agoMerge branch '7667-node-manager-logging' refs #7667
Peter Amstutz [Mon, 8 Feb 2016 16:28:53 +0000 (11:28 -0500)]
Merge branch '7667-node-manager-logging' refs #7667

8 years ago7667: Store node size in a table so to avoid blocking on booting and shutdown
Peter Amstutz [Mon, 8 Feb 2016 16:28:11 +0000 (11:28 -0500)]
7667: Store node size in a table so to avoid blocking on booting and shutdown
actors to ask node size.

8 years ago7667: Fix log message
Peter Amstutz [Mon, 8 Feb 2016 03:52:51 +0000 (22:52 -0500)]
7667: Fix log message

8 years agoMerge branch '8285-fuse-subscribe-websockets' closes #8285
Tom Clegg [Sat, 6 Feb 2016 00:45:30 +0000 (19:45 -0500)]
Merge branch '8285-fuse-subscribe-websockets' closes #8285

8 years ago8285: Test that arvados.events.subscribe() is called only when needed.
Tom Clegg [Sat, 6 Feb 2016 00:39:42 +0000 (19:39 -0500)]
8285: Test that arvados.events.subscribe() is called only when needed.

Add missing TagsDirectory.want_event_subscribe().

8 years ago8285: Add test for listen_for_events
Peter Amstutz [Sat, 6 Feb 2016 00:17:42 +0000 (19:17 -0500)]
8285: Add test for listen_for_events

8 years ago8285: Add want_event_subscribe flag to subclasses of fusedir.Directory,
Peter Amstutz [Fri, 5 Feb 2016 21:39:25 +0000 (16:39 -0500)]
8285: Add want_event_subscribe flag to subclasses of fusedir.Directory,
determine whether to call listen_for_events based on it.

8 years ago7667: Combine polling logs into fewer lines for less noise. Adjust message
Peter Amstutz [Fri, 5 Feb 2016 16:10:43 +0000 (11:10 -0500)]
7667: Combine polling logs into fewer lines for less noise.  Adjust message
when last_ping_at is unexpectedly none to be less severe (can happen in
innocent circumstances).  Report nodes in "booted" list as "booting" since they
are unpaired.  Fix tests.

8 years ago7868: Update API server's arvados-cli version.
Brett Smith [Fri, 5 Feb 2016 09:52:43 +0000 (04:52 -0500)]
7868: Update API server's arvados-cli version.

Curoverse clusters are deployed by setting CRUNCH_JOB_BIN,
effectively excluding it from the bundle, but this is not true for
clusters deployed following the install guide.  Out of the box,
they'll use the version of crunch-job that's actually in the
arvados-cli gem in the bundle.

crunch-dispatch has functionality in it that requires a newer
arvados-cli, so update accordingly.  This is not exactly the version
produced by #7868, but it's pretty close.

I think there's a strong case that we should update this version
whenever we make a substantial change to crunch-job.  But since I'm
pushing this without discussion or review, I'm doing the smallest
thing possible.

Refs #7868.

8 years ago7667: Node manager bug fixes and logging improvements.
Peter Amstutz [Thu, 4 Feb 2016 23:46:31 +0000 (18:46 -0500)]
7667: Node manager bug fixes and logging improvements.

 * ComputeNodeSetupActor will now finish if there is an unhandled exception.

 * ComputeNodeMonitorActor now explains why a node that is in the shutdown window
is not eligible for shutdown.

 * Logging in nodes_wanted now distinguishes idle/busy/booting/shutting down.

 * Logging by actors is now class name and a portion of the actor urn, so actions
of a specific actor can be consistently identified.

8 years agoRecognize another way slurm tells us about node failures.
Tom Clegg [Thu, 4 Feb 2016 19:29:39 +0000 (14:29 -0500)]
Recognize another way slurm tells us about node failures.

Retry, instead of giving up, in situations like this:

2016-02-02_08:42:26 wx7k5-8i9sb-guk2lv53z3572dc 40682 3 stderr srun: error: Aborting, io error and missing step on node 0
2016-02-02_08:42:26 wx7k5-8i9sb-guk2lv53z3572dc 40682 3 stderr srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
2016-02-02_08:42:28 wx7k5-8i9sb-guk2lv53z3572dc 40682 3 stderr srun: error: Timed out waiting for job step to complete
2016-02-02_08:42:28 wx7k5-8i9sb-guk2lv53z3572dc 40682 3 child 42984 on compute26.1 exit 0 success=
2016-02-02_08:42:28 wx7k5-8i9sb-guk2lv53z3572dc 40682 3 ERROR: Task process exited 0, but never updated its task record to indicate success and record its output.
2016-02-02_08:42:28 wx7k5-8i9sb-guk2lv53z3572dc 40682 3 failure (#1, permanent) after 560 seconds
2016-02-02_08:42:28 wx7k5-8i9sb-guk2lv53z3572dc 40682 3 task output (0 bytes):

No issue #

8 years agoMerge branch '8288-poll-client-close-timeout' refs #8288
Tom Clegg [Thu, 4 Feb 2016 18:17:31 +0000 (13:17 -0500)]
Merge branch '8288-poll-client-close-timeout' refs #8288

8 years ago8288: Add timeout option to close() method of event clients.
Tom Clegg [Mon, 1 Feb 2016 06:58:34 +0000 (01:58 -0500)]
8288: Add timeout option to close() method of event clients.

Previously in EventClient, close() didn't wait for anything. Now, if a
timeout is given, it waits for ws4py to call the closed() callback to
indicate the connection has closed.

Previously in PollClient, close() waited indefinitely for the polling
thread to terminate.  This can take a very long time if, for example,
there are multiple subscriptions and the "get logs" API transaction is
slow.

The only apparent reason a caller would want to wait here at all is to
guarantee the simplifying assumption the on_event() callback is never
called after close().  Now, instead of letting the thread run until
all events are received and handled, PollClient achieves this the same
way EventClient does: ignore events that arrive after close().

8 years agoMake install guide slurm.conf more Arvados-compliant.
Brett Smith [Thu, 4 Feb 2016 10:33:24 +0000 (05:33 -0500)]
Make install guide slurm.conf more Arvados-compliant.

* SelectType=select/linear allocates entire nodes at a time.  The
  previous value scheduled individual cores.
* With that change, SelectTypeParameters=CR_CPU_Memory is not valid.
  Remove it, as we do in production.
* The setting of FastSchedule seems less pressing, but 0 is what we
  use in production, so share that here too.

No issue #.

8 years agoTry to make logging identify the actor consistently
Peter Amstutz [Wed, 3 Feb 2016 22:51:46 +0000 (17:51 -0500)]
Try to make logging identify the actor consistently

8 years agoMerge branch '6702-gce-node-create-fix' closes #6702
Peter Amstutz [Wed, 3 Feb 2016 20:54:18 +0000 (15:54 -0500)]
Merge branch '6702-gce-node-create-fix' closes #6702

8 years agoMerge branch '8288-arv-mount-deadlock' refs #8288
Tom Clegg [Wed, 3 Feb 2016 17:51:54 +0000 (12:51 -0500)]
Merge branch '8288-arv-mount-deadlock' refs #8288

8 years ago8288: Do not call operations.destroy() as a last resort, just abandon the llfuse...
Tom Clegg [Tue, 2 Feb 2016 21:46:35 +0000 (16:46 -0500)]
8288: Do not call operations.destroy() as a last resort, just abandon the llfuse thread.

8 years ago8288: Add test case for --exec mode.
Tom Clegg [Mon, 1 Feb 2016 08:01:31 +0000 (03:01 -0500)]
8288: Add test case for --exec mode.

8 years ago8288: Give fusermount -u a chance to work before resorting to operations.destroy().
Tom Clegg [Mon, 1 Feb 2016 02:43:30 +0000 (21:43 -0500)]
8288: Give fusermount -u a chance to work before resorting to operations.destroy().

Log a warning when resorting to operations.destroy().

De-duplicate setup/teardown code so more of the --exec code path is exercised in tests.

8 years ago8123: Install chartjs.js asset file.
Tom Clegg [Wed, 3 Feb 2016 17:50:31 +0000 (12:50 -0500)]
8123: Install chartjs.js asset file.

...during "setup.py install" too, not just when installing via
package.

refs #8123

8 years agoImprove install guide Nginx+SCL integration.
Brett Smith [Wed, 3 Feb 2016 11:42:17 +0000 (06:42 -0500)]
Improve install guide Nginx+SCL integration.

No issue #.

8 years agologin-sync gets user's home from /etc/passwd.
Brett Smith [Wed, 3 Feb 2016 11:26:32 +0000 (06:26 -0500)]
login-sync gets user's home from /etc/passwd.

No issue #.

8 years agoWorkbench loads CA certs on Red Hat.
Brett Smith [Wed, 3 Feb 2016 10:37:42 +0000 (05:37 -0500)]
Workbench loads CA certs on Red Hat.

This has the same rationale and logic as #6432 and
9b910084faf3db6fa2071af604620e7d45d12a6c, applied to Workbench.

Changing from `/etc/ssl/certs` to `/etc/ssl/certs/ca-certificates.crt`
is safe, because add_trust_ca accepts either a directory with hashed
certs, or a file with multiple certs.  On Debian, the latter path is a
single file built from the hashed certs in the former, so this is
functionally identical there, and more predictable on Red Hat (where I
don't know what it's doing).

No issue #.

8 years agoAdd fuse dependency to FUSE driver package.
Brett Smith [Wed, 3 Feb 2016 09:53:04 +0000 (04:53 -0500)]
Add fuse dependency to FUSE driver package.

When the fuse tools aren't installed, attempting to run arv-mount
fails with "fuse: failed to exec fusermount".

No issue #.

8 years agoAdd curl library dependency to shell install guide.
Brett Smith [Wed, 3 Feb 2016 09:39:27 +0000 (04:39 -0500)]
Add curl library dependency to shell install guide.

No isse #.

8 years agoSLURM install guide notes slurm.conf path on Red Hat.
Brett Smith [Wed, 3 Feb 2016 09:32:39 +0000 (04:32 -0500)]
SLURM install guide notes slurm.conf path on Red Hat.

No issue #.

8 years agoAdd missing ; in keepproxy Nginx config.
Brett Smith [Wed, 3 Feb 2016 09:26:49 +0000 (04:26 -0500)]
Add missing ; in keepproxy Nginx config.

No issue #.

8 years ago6702: Refactor create_node to BaseComputeNodeDriver so logic also applies to
Peter Amstutz [Tue, 2 Feb 2016 17:26:57 +0000 (12:26 -0500)]
6702: Refactor create_node to BaseComputeNodeDriver so logic also applies to
Azure.  Adds new find_node() method; if returns None or raises an error,
re-raise the original create_node exception.

8 years agoMerge branch '6702-gce-node-create-fix' closes #6702
Peter Amstutz [Tue, 2 Feb 2016 16:31:15 +0000 (11:31 -0500)]
Merge branch '6702-gce-node-create-fix' closes #6702

8 years agoMerge branch 'fix/build-python-llfuse-version' of https://github.com/wtsi-hgi/arvados
Peter Amstutz [Tue, 2 Feb 2016 16:05:50 +0000 (11:05 -0500)]
Merge branch 'fix/build-python-llfuse-version' of https://github.com/wtsi-hgi/arvados
no issue #

8 years agoMerge branch 'master' into 6702-gce-node-create-fix
Peter Amstutz [Tue, 2 Feb 2016 15:56:13 +0000 (10:56 -0500)]
Merge branch 'master' into 6702-gce-node-create-fix

8 years agoMerge branch '8206-gce-retry-init' closes #8206
Peter Amstutz [Tue, 2 Feb 2016 15:55:58 +0000 (10:55 -0500)]
Merge branch '8206-gce-retry-init' closes #8206

8 years ago8206: Mock time.sleep() to avoid unnecessary delay in test.
Peter Amstutz [Tue, 2 Feb 2016 15:55:39 +0000 (10:55 -0500)]
8206: Mock time.sleep() to avoid unnecessary delay in test.

8 years agopins python-llfuse version to 0.41.1 for fpm on all platforms
Joshua Randall [Tue, 2 Feb 2016 15:45:46 +0000 (15:45 +0000)]
pins python-llfuse version to 0.41.1 for fpm on all platforms

8 years ago8206: Refactor _retry to RetryMixin. Make retry timing consistent.
Peter Amstutz [Tue, 2 Feb 2016 15:03:39 +0000 (10:03 -0500)]
8206: Refactor _retry to RetryMixin.  Make retry timing consistent.

8 years ago8005: Install guide suggests slurm-munge on Red Hat SLURM nodes.
Brett Smith [Tue, 2 Feb 2016 12:23:10 +0000 (07:23 -0500)]
8005: Install guide suggests slurm-munge on Red Hat SLURM nodes.

This package includes the SLURM plugins that talk to MUNGE.
Refs #8005.

8 years ago6702: Catch GCE create_node() errors and check if the node was actually
Peter Amstutz [Mon, 1 Feb 2016 19:54:28 +0000 (14:54 -0500)]
6702: Catch GCE create_node() errors and check if the node was actually
created.  Added test.

8 years ago8014: Remove more upgrade script references from install guide.
Brett Smith [Mon, 1 Feb 2016 17:43:04 +0000 (12:43 -0500)]
8014: Remove more upgrade script references from install guide.

The steps removed are now handled by Rails package postinst scripts.
This should've been done in 378a988bbf9e29736382339f587582259b641782,
but was overlooked.  Refs #8014.

8 years agoRefresh Gitolite install guide.
Brett Smith [Mon, 1 Feb 2016 16:53:29 +0000 (11:53 -0500)]
Refresh Gitolite install guide.

* Tested instructions still work with 3.6.4.  So noted.
* Prefer cloning Gitolite over HTTPS, since that's less likely to be
  firewalled.

No issue #.

8 years agoFix install doc rendering of API Nginx config.
Brett Smith [Mon, 1 Feb 2016 16:51:14 +0000 (11:51 -0500)]
Fix install doc rendering of API Nginx config.

<notextile> doesn't actually nest like proper HTML, it's just a
boolean that remembers the last state.  Turn it back on after doing an
include that turns it off.  No issue #.

8 years agoPin llfuse to 0.41.1 because 0.42 came out and broke things. no issue #
Peter Amstutz [Mon, 1 Feb 2016 14:14:41 +0000 (09:14 -0500)]
Pin llfuse to 0.41.1 because 0.42 came out and broke things.  no issue #

8 years agoMerge branch '8005-centos-3rdparty-installs-wip'
Brett Smith [Fri, 29 Jan 2016 00:38:04 +0000 (19:38 -0500)]
Merge branch '8005-centos-3rdparty-installs-wip'

Closes #8005, #8135.

8 years ago8005: Add tar Ruby build dependency on CentOS 6.
Brett Smith [Fri, 29 Jan 2016 00:27:13 +0000 (19:27 -0500)]
8005: Add tar Ruby build dependency on CentOS 6.

8 years ago8005: Install guide uses runit packages on Red Hat.
Brett Smith [Thu, 28 Jan 2016 00:02:05 +0000 (19:02 -0500)]
8005: Install guide uses runit packages on Red Hat.

The runit RPMs only provide /etc/service.  The .debs provide /etc/sv
and /etc/service.  Our understanding is that /etc/sv is for all
service definitions (akin to /etc/init.d), and /etc/service is for
service definitions that runit should start at boot (akin to
/etc/rcN.d).  To provide uniformity, our install guide instructs users
to make /etc/sv if needed, and link it to /etc/service.

This commit could go farther.  Today it would be best if all the runit
sections in the install guide followed Tom's modern template used for
arv-git-httpd and arvados-docker-cleaner.  However, that will probably
require some creation and testing of log/run scripts, and some
adaptation of the run scripts to fit the template.  I wish I could
include those improvements now, but unfortunately I'm out of time, so
they'll have to wait for another day.

8 years ago8005: Install guide gets SLURM and MUNGE from RPMs.
Brett Smith [Thu, 28 Jan 2016 00:08:33 +0000 (19:08 -0500)]
8005: Install guide gets SLURM and MUNGE from RPMs.

8 years ago8005: Fix bad Textile markup in compute node install guide.
Brett Smith [Wed, 27 Jan 2016 23:54:57 +0000 (18:54 -0500)]
8005: Fix bad Textile markup in compute node install guide.

The switch dashes created strikethrough for much of the notebox.

8 years ago8005: Document installing Git on CentOS 6 from RepoForge.
Brett Smith [Wed, 27 Jan 2016 20:15:23 +0000 (15:15 -0500)]
8005: Document installing Git on CentOS 6 from RepoForge.

8 years ago8005: DRY up PostgreSQL password auth instructions on CentOS 6.
Brett Smith [Wed, 27 Jan 2016 20:00:17 +0000 (15:00 -0500)]
8005: DRY up PostgreSQL password auth instructions on CentOS 6.

8 years agoMake our API server packages for debian-based distributions depend on
Ward Vandewege [Thu, 28 Jan 2016 19:32:00 +0000 (14:32 -0500)]
Make our API server packages for debian-based distributions depend on
libcurl-ssl-dev rather than libcurl4-openssl-dev.

No issue #

8 years agocloses #8198
radhika [Tue, 26 Jan 2016 17:10:04 +0000 (12:10 -0500)]
closes #8198
Merge branch '8198-node-ip-address'

8 years agoMerge branch 'master' into 8198-node-ip-address
radhika [Tue, 26 Jan 2016 17:09:37 +0000 (12:09 -0500)]
Merge branch 'master' into 8198-node-ip-address

8 years agorefs #8178
radhika [Tue, 26 Jan 2016 17:08:21 +0000 (12:08 -0500)]
refs #8178
Merge branch '8178-keepstore-trash-interface'

8 years agoMerge branch '8178-keepstore-trash-interface' of git.curoverse.com:arvados into 8178...
radhika [Tue, 26 Jan 2016 15:41:00 +0000 (10:41 -0500)]
Merge branch '8178-keepstore-trash-interface' of git.curoverse.com:arvados into 8178-keepstore-trash-interface

Conflicts:
services/keepstore/handlers.go
services/keepstore/volume_test.go

8 years ago8178: untrash should fail when ErrNotImplemented is returned.
radhika [Tue, 26 Jan 2016 15:38:28 +0000 (10:38 -0500)]
8178: untrash should fail when ErrNotImplemented is returned.

8 years ago8178: (for now) all volumes must return ErrNotImplemented if trash-lifetime != 0
radhika [Fri, 22 Jan 2016 22:37:15 +0000 (17:37 -0500)]
8178: (for now) all volumes must return ErrNotImplemented if trash-lifetime != 0

8 years ago8178: All three currently supported volumes return error when trash-lifetime period...
radhika [Thu, 21 Jan 2016 20:25:06 +0000 (15:25 -0500)]
8178: All three currently supported volumes return error when trash-lifetime period is not configured. azure blob and s3 volumes are updated to do so.
Returning an error is causing test failures in unix volume and hence is still a work in progress.

8 years ago8178: rename Delete api as Trash; add Untrash to volume interface; add UndeleteHandle...
radhika [Thu, 21 Jan 2016 18:59:36 +0000 (13:59 -0500)]
8178: rename Delete api as Trash; add Untrash to volume interface; add UndeleteHandler and test for this endpoint.

8 years ago8206: Add test to support retry on create_driver.
Peter Amstutz [Mon, 25 Jan 2016 22:02:40 +0000 (17:02 -0500)]
8206: Add test to support retry on create_driver.

8 years agoMerge branch '8123-crunchstat-graphs' closes #8123
Tom Clegg [Mon, 25 Jan 2016 21:08:14 +0000 (16:08 -0500)]
Merge branch '8123-crunchstat-graphs' closes #8123

8 years ago8123: Escape HTML chars in page title.
Tom Clegg [Mon, 25 Jan 2016 21:05:56 +0000 (16:05 -0500)]
8123: Escape HTML chars in page title.

8 years ago8206: Refactor _retry into common function wrapper usable by both dispatch and
Peter Amstutz [Mon, 25 Jan 2016 20:36:34 +0000 (15:36 -0500)]
8206: Refactor _retry into common function wrapper usable by both dispatch and
compute drivers.

8 years ago8123: Explain existing_constraints and use a proper instance variable.
Tom Clegg [Mon, 25 Jan 2016 06:16:44 +0000 (01:16 -0500)]
8123: Explain existing_constraints and use a proper instance variable.

8 years ago8123: Fix accidental old-style class.
Tom Clegg [Mon, 25 Jan 2016 06:08:27 +0000 (01:08 -0500)]
8123: Fix accidental old-style class.

8 years ago8123: Fix type check to accommodate unicode.
Tom Clegg [Mon, 25 Jan 2016 06:00:03 +0000 (01:00 -0500)]
8123: Fix type check to accommodate unicode.

8 years ago8123: Use -v,-vv instead of --verbose,--debug.
Tom Clegg [Mon, 25 Jan 2016 05:59:46 +0000 (00:59 -0500)]
8123: Use -v,-vv instead of --verbose,--debug.

8 years ago8123: Change --include-child-jobs to --skip-child-jobs (default False).
Tom Clegg [Mon, 25 Jan 2016 02:07:42 +0000 (21:07 -0500)]
8123: Change --include-child-jobs to --skip-child-jobs (default False).

8 years ago8123: Explain mysterious memory constraint logic.
Tom Clegg [Mon, 25 Jan 2016 02:06:48 +0000 (21:06 -0500)]
8123: Explain mysterious memory constraint logic.

8 years ago8123: Update test dependencies.
Tom Clegg [Mon, 25 Jan 2016 02:05:28 +0000 (21:05 -0500)]
8123: Update test dependencies.

8 years ago8284: Fix confusion between %proc and %jobstep.
Tom Clegg [Mon, 25 Jan 2016 00:48:06 +0000 (19:48 -0500)]
8284: Fix confusion between %proc and %jobstep.

$proc{$pid}->{jobstep} is an index into @jobstep
$proc{$pid}->{jobstepname} is the name we told srun to use
$proc{$pid}->{killtime} is a deadline when we should kill the process
$jobstep[$jobstepid]->{stderr_at} is the time of last stderr received

We were mistakenly using $proc->{$pid}->{stderr_at}, which was always
undef and therefore always less than $last_squeue_check. This resulted
in jobs being killed as "slurm orphans" when the real reason they
hadn't been returned by waitpid() was that we hadn't finished
consuming their stderr yet.