Peter Amstutz [Thu, 2 Jun 2016 21:59:20 +0000 (17:59 -0400)]
9187: Improve squeue synchronization
* Put squeue functions into separate file.
* CheckSqueue() now blocks on a condition variable until the next successful
update of squeue, which then wakes up all goroutines waiting on CheckSqueue().
* Never do anything when squeue returns an error.
* Merge submitting, monitoring, and cleanup behaviors into a single goroutine
which updates based on CheckSqueue() instead of a ticker.
* Introduce a lock on squeue, sbatch and scancel operations, so that on next
wakeup the queue is guaranteed to reflect most recent sbatch/scancel
operations.
Peter Amstutz [Wed, 1 Jun 2016 20:06:26 +0000 (16:06 -0400)]
9187: Slurm dispatcher improvements around squeue
* Clarify that status updates are not guaranteed to be delivered on a
heartbeat.
* Refactor slurm dispatcher to monitor the container in squeue in a separate
goroutine.
* Refactor polling squeue to a single goroutine and cache the results so that
monitoring 100 containers doesn't result in 100 calls to squeue.
* No longer set up strigger to cancel job on finish, instead cancel running
jobs not in squeue.
* Test both cases where a job is/is not in squeue.
Peter Amstutz [Thu, 19 May 2016 18:12:42 +0000 (14:12 -0400)]
9187: Refactor dispatcher support into common library and update to use Locking API.
New dispatcher package in Go SDK provides framework for monitoring list of
queued/locked/running containers. Try to lock containers in the queue; locked
or running containers are passed to RunContainer goroutine supplied by the
specific dispatcher. Refactor existing dispatchers (-local and -slurm) to use
this framework. Dispatchers have crash recovery behavior, can put containers
which are unaccounted in cancelled state.
Peter Amstutz [Thu, 26 May 2016 13:51:24 +0000 (09:51 -0400)]
9303: Fetch arv_node before trying to shut down node, because monitor actor may
go away once the node has been successfully shut down. Also handle case of
node_finished_shutdown called after shutdown actor is stopped.
Peter Amstutz [Wed, 18 May 2016 20:33:57 +0000 (16:33 -0400)]
8653: cwl-runner crunch script rewrites keep file paths into CWL File objects.
Clean up argument handling in arvados-cwl-runner so that --create-template
doesn't require a job object, and that --help doesn't present options that are
irrelevant or don't work.
Peter Amstutz [Tue, 17 May 2016 20:59:20 +0000 (16:59 -0400)]
8236: Restore os.killpg(). Create a new process group so that it won't kill
the parent process by accident. Watchdog process now only monitors specific
actors.
Peter Amstutz [Mon, 16 May 2016 14:29:50 +0000 (10:29 -0400)]
9161: Decisions to start and stop compute nodes are now based on an explicit
set of states: booting, unpaired, idle, busy, down, shutdown. Refactor to
remove 'shutdowns' dict and fold into cloud_nodes. Nodes_wanted uses same
computation of node state as used for decision to shut down nodes. Nodes for
which the state is unclear are either idle (if in the boot grace period) or
down (if older).
Peter Amstutz [Fri, 13 May 2016 20:09:10 +0000 (16:09 -0400)]
9161: Add _nodemanager_recently_booted as new way of remembering nodes which are in intermediate state between being created and showing up in the cloud node list.
Brett Smith [Thu, 12 May 2016 20:40:37 +0000 (16:40 -0400)]
9213: Improve gem loading in `arv`.
* Include the exception string in the error message.
* Separate stdlib loading problems from gem loading problems.
* Load gems with more dependencies first, to avoid situations like
this:
irb(main):001:0> require 'active_support/inflector'
=> true
irb(main):002:0> require 'arvados/google_api_client'
Gem::LoadError: Unable to activate arvados-0.1.20160420143004, because activesupport-4.2.6 conflicts with activesupport (< 4.2.6, >= 3)
Brett Smith [Mon, 9 May 2016 16:54:23 +0000 (12:54 -0400)]
9135: Bring EventClient's public interface closer to PollClient's.
* Restore the run_forever method, which was previously inherited from
WebSocketClient.
* Remove the connect and close_connection methods, which are
WebSocketClient implementation details that don't make sense as part
of the public interface. (A running EventClient will just reconnect
if you call close_connection on it.)
Brett Smith [Mon, 9 May 2016 16:57:42 +0000 (12:57 -0400)]
9135: Make EventClient initialization more consistent.
* DRY up the setup code. This includes always trying to close the
conenction after failure, since we were doing that in the initial
connection.
* Make the client a daemon thread, for consistency with PollClient.
Peter Amstutz [Fri, 13 May 2016 14:11:39 +0000 (10:11 -0400)]
9161: Eliminate 'booted' list and put nodes directly into cloud_nodes list.
Refactor logic for registering cloud nodes. Refactor computation of nodes
wanted; explicitly model 'unpaired' and 'down'.
Peter Amstutz [Wed, 11 May 2016 20:55:00 +0000 (16:55 -0400)]
9161: There's a window between when a node pings for the first time and the
value of 'slurm_state' is synchronized by crunch-dispatch. In this window, the
node will still report as 'down'. Check first_ping_at and implement a grace
period where the node should will be considered 'idle'.
Add configuration parameter 'async_permissions_update' (default false). If
true, do not delete permission cache in #invalidate_permissions_cache, but
instead trigger "NOTIFY invalidate_permissions_cache" on the database.
Add script/permission-updater.rb which runs as an independent process. It
blocks on "LISTEN invalidate_permissions_cache" and updates the permission
cache whenever notified.
This is not ready for use; in particular it creates a race condition
recomputing permissions with effects such as not being able to read back API
records that were just created.
Tom Clegg [Fri, 29 Apr 2016 16:55:24 +0000 (12:55 -0400)]
9068: Move buffer allocation from volumes to GetBlockHandler.
This makes the Volume interface more idiomatic: Get() accepts a buffer
to read into, and returns a number of bytes read, much like the Read()
method of an io.Reader.
It also makes it possible for GetBlockHandler to notice, while waiting
for a buffer, that the client has disconnected: In this case, it
releases the network socket and never asks any volumes to do any work.