Peter Amstutz [Wed, 18 May 2016 20:33:57 +0000 (16:33 -0400)]
8653: cwl-runner crunch script rewrites keep file paths into CWL File objects.
Clean up argument handling in arvados-cwl-runner so that --create-template
doesn't require a job object, and that --help doesn't present options that are
irrelevant or don't work.
Peter Amstutz [Tue, 17 May 2016 20:59:20 +0000 (16:59 -0400)]
8236: Restore os.killpg(). Create a new process group so that it won't kill
the parent process by accident. Watchdog process now only monitors specific
actors.
Peter Amstutz [Mon, 16 May 2016 14:29:50 +0000 (10:29 -0400)]
9161: Decisions to start and stop compute nodes are now based on an explicit
set of states: booting, unpaired, idle, busy, down, shutdown. Refactor to
remove 'shutdowns' dict and fold into cloud_nodes. Nodes_wanted uses same
computation of node state as used for decision to shut down nodes. Nodes for
which the state is unclear are either idle (if in the boot grace period) or
down (if older).
Peter Amstutz [Fri, 13 May 2016 20:09:10 +0000 (16:09 -0400)]
9161: Add _nodemanager_recently_booted as new way of remembering nodes which are in intermediate state between being created and showing up in the cloud node list.
Brett Smith [Thu, 12 May 2016 20:40:37 +0000 (16:40 -0400)]
9213: Improve gem loading in `arv`.
* Include the exception string in the error message.
* Separate stdlib loading problems from gem loading problems.
* Load gems with more dependencies first, to avoid situations like
this:
irb(main):001:0> require 'active_support/inflector'
=> true
irb(main):002:0> require 'arvados/google_api_client'
Gem::LoadError: Unable to activate arvados-0.1.20160420143004, because activesupport-4.2.6 conflicts with activesupport (< 4.2.6, >= 3)
Brett Smith [Mon, 9 May 2016 16:54:23 +0000 (12:54 -0400)]
9135: Bring EventClient's public interface closer to PollClient's.
* Restore the run_forever method, which was previously inherited from
WebSocketClient.
* Remove the connect and close_connection methods, which are
WebSocketClient implementation details that don't make sense as part
of the public interface. (A running EventClient will just reconnect
if you call close_connection on it.)
Brett Smith [Mon, 9 May 2016 16:57:42 +0000 (12:57 -0400)]
9135: Make EventClient initialization more consistent.
* DRY up the setup code. This includes always trying to close the
conenction after failure, since we were doing that in the initial
connection.
* Make the client a daemon thread, for consistency with PollClient.
Peter Amstutz [Fri, 13 May 2016 14:11:39 +0000 (10:11 -0400)]
9161: Eliminate 'booted' list and put nodes directly into cloud_nodes list.
Refactor logic for registering cloud nodes. Refactor computation of nodes
wanted; explicitly model 'unpaired' and 'down'.
Peter Amstutz [Wed, 11 May 2016 20:55:00 +0000 (16:55 -0400)]
9161: There's a window between when a node pings for the first time and the
value of 'slurm_state' is synchronized by crunch-dispatch. In this window, the
node will still report as 'down'. Check first_ping_at and implement a grace
period where the node should will be considered 'idle'.
Add configuration parameter 'async_permissions_update' (default false). If
true, do not delete permission cache in #invalidate_permissions_cache, but
instead trigger "NOTIFY invalidate_permissions_cache" on the database.
Add script/permission-updater.rb which runs as an independent process. It
blocks on "LISTEN invalidate_permissions_cache" and updates the permission
cache whenever notified.
This is not ready for use; in particular it creates a race condition
recomputing permissions with effects such as not being able to read back API
records that were just created.
Tom Clegg [Fri, 29 Apr 2016 16:55:24 +0000 (12:55 -0400)]
9068: Move buffer allocation from volumes to GetBlockHandler.
This makes the Volume interface more idiomatic: Get() accepts a buffer
to read into, and returns a number of bytes read, much like the Read()
method of an io.Reader.
It also makes it possible for GetBlockHandler to notice, while waiting
for a buffer, that the client has disconnected: In this case, it
releases the network socket and never asks any volumes to do any work.
Tom Clegg [Mon, 2 May 2016 21:13:35 +0000 (17:13 -0400)]
Use "grep -xF ... >/dev/null" instead of "grep -qxF ..."
1. -q "Exit immediately with zero status if any match is found, even if
an error was detected." --grep(1)
Depending on buffering and timing, if grep exits early (before
consuming stdin) "docker images" can receive SIGPIPE and exit
non-zero. We use "set -o pipefail" here, so this fails the "docker
load" phase and then the whole job.
2. "Portable shell scripts should avoid both -q and -s and should
redirect standard and error output to /dev/null instead." --grep(1)
Peter Amstutz [Fri, 29 Apr 2016 13:11:15 +0000 (09:11 -0400)]
8998: Monkey patch URI.decode_www_form_component to validate efficiently.
Rack uses the standard library method URI.decode_www_form_component to process
parameters. This method first validates the string with a regular expression,
and then decodes it using another regular expression. Ruby 2.1 and earlier has
a bug is in the validation; the regular expression that is used generates many
backtracking points, which results in exponential memory growth when matching
large strings. The fix is to monkey-patch the version of the method from Ruby
2.2 which checks that the string is not invalid instead of checking it is
valid.