Brett Smith [Fri, 27 May 2016 22:30:57 +0000 (18:30 -0400)]
8959: Remove redundant python-gflags fpm-info.sh.
I added this file in 495a485ff. Later, Nico pinned the version in
run-build-packages, in a8bbf6ef, to try to fix #8959. However, odds
are that #8959 was an ops problem, and not a package building problem:
the gflags 3.0 packages were still published on our repository, and
needed to be removed there.
Having both files causes trouble when you're building backports from
scratch. We haven't noticed because Jenkins never does that. But
I'm working on new packages and getting:
Loading fpm overrides from /arvados/backports/python-gflags/fpm-info.sh
Peter Amstutz [Thu, 26 May 2016 13:51:24 +0000 (09:51 -0400)]
9303: Fetch arv_node before trying to shut down node, because monitor actor may
go away once the node has been successfully shut down. Also handle case of
node_finished_shutdown called after shutdown actor is stopped.
Peter Amstutz [Wed, 18 May 2016 20:33:57 +0000 (16:33 -0400)]
8653: cwl-runner crunch script rewrites keep file paths into CWL File objects.
Clean up argument handling in arvados-cwl-runner so that --create-template
doesn't require a job object, and that --help doesn't present options that are
irrelevant or don't work.
Peter Amstutz [Tue, 17 May 2016 20:59:20 +0000 (16:59 -0400)]
8236: Restore os.killpg(). Create a new process group so that it won't kill
the parent process by accident. Watchdog process now only monitors specific
actors.
Peter Amstutz [Mon, 16 May 2016 14:29:50 +0000 (10:29 -0400)]
9161: Decisions to start and stop compute nodes are now based on an explicit
set of states: booting, unpaired, idle, busy, down, shutdown. Refactor to
remove 'shutdowns' dict and fold into cloud_nodes. Nodes_wanted uses same
computation of node state as used for decision to shut down nodes. Nodes for
which the state is unclear are either idle (if in the boot grace period) or
down (if older).
Peter Amstutz [Fri, 13 May 2016 20:09:10 +0000 (16:09 -0400)]
9161: Add _nodemanager_recently_booted as new way of remembering nodes which are in intermediate state between being created and showing up in the cloud node list.
Brett Smith [Thu, 12 May 2016 20:40:37 +0000 (16:40 -0400)]
9213: Improve gem loading in `arv`.
* Include the exception string in the error message.
* Separate stdlib loading problems from gem loading problems.
* Load gems with more dependencies first, to avoid situations like
this:
irb(main):001:0> require 'active_support/inflector'
=> true
irb(main):002:0> require 'arvados/google_api_client'
Gem::LoadError: Unable to activate arvados-0.1.20160420143004, because activesupport-4.2.6 conflicts with activesupport (< 4.2.6, >= 3)
Brett Smith [Mon, 9 May 2016 16:54:23 +0000 (12:54 -0400)]
9135: Bring EventClient's public interface closer to PollClient's.
* Restore the run_forever method, which was previously inherited from
WebSocketClient.
* Remove the connect and close_connection methods, which are
WebSocketClient implementation details that don't make sense as part
of the public interface. (A running EventClient will just reconnect
if you call close_connection on it.)
Brett Smith [Mon, 9 May 2016 16:57:42 +0000 (12:57 -0400)]
9135: Make EventClient initialization more consistent.
* DRY up the setup code. This includes always trying to close the
conenction after failure, since we were doing that in the initial
connection.
* Make the client a daemon thread, for consistency with PollClient.
Peter Amstutz [Fri, 13 May 2016 14:11:39 +0000 (10:11 -0400)]
9161: Eliminate 'booted' list and put nodes directly into cloud_nodes list.
Refactor logic for registering cloud nodes. Refactor computation of nodes
wanted; explicitly model 'unpaired' and 'down'.
Peter Amstutz [Wed, 11 May 2016 20:55:00 +0000 (16:55 -0400)]
9161: There's a window between when a node pings for the first time and the
value of 'slurm_state' is synchronized by crunch-dispatch. In this window, the
node will still report as 'down'. Check first_ping_at and implement a grace
period where the node should will be considered 'idle'.