Brett Smith [Mon, 2 Mar 2015 16:27:59 +0000 (11:27 -0500)]
4751: Node Manager considers ping times for stricter node pairing.
Because the pairing decision is currently based on IP address alone,
Node Manager will occasionally pair a cloud node with the wrong
Arvados node after an IP address is reused. Fix that by bringing the
node's first_ping_at into consideration: if it's older than the cloud
node, refuse to pair.
This more closely matches the behavior of the EC2 driver, which we
want.
* Upgrade to libcloud 0.16, which adds an ex_disk_auto_delete argument
to GCE's create_node method, with True as the default.
* Set destroy_boot_disk=True when calling destroy_node().
Brett Smith [Wed, 25 Feb 2015 16:37:26 +0000 (11:37 -0500)]
5283: crunch-job doesn't use freeze logic after a job fails.
If the job has failed permanently, we want to go through all the
end-of-job logic. Previously, we were getting sidetracked into
freeze_if_want_freeze, which skips some steps like setting the
permanent job output record. Refs #4472.
Brett Smith [Fri, 27 Feb 2015 19:20:12 +0000 (14:20 -0500)]
5283: Improve reliability of crunch-job output collation.
* Check the results of all pipe opens, exit statuses, and writes.
Log any problems.
* Have fetch_block return undef when it encounters trouble, rather
than dying. create_output_collection already checks for this, so it
effectively bubbles up the error.
* Retry all of the associated API calls.
* Kill the manifest creation pipe if we give up on it, per the TODO.
This probably won't resolve #5283, but hopefully these changes will
give us additional information to help diagnose the problem.
Peter Amstutz [Wed, 25 Feb 2015 21:42:10 +0000 (16:42 -0500)]
5309: Fix keepclient and keepproxy bugs related to error handling:
* KeepClient: Handle nil response and nil response Body from Client.Do(GET)
* KeepProxy: Only defer reader.Close() if reader is not nil
* KeepProxy tests: Add test for GET failure to read (404)
Peter Amstutz [Wed, 25 Feb 2015 19:09:09 +0000 (14:09 -0500)]
5310: Use c.get('name') instead of c['name']
* 'name' isn't necessarily present when obj_uuid is a PDH,
src.collections().get(uuid=obj_uuid).execute() may return a synthetic record
without a name field.
Peter Amstutz [Mon, 23 Feb 2015 21:36:09 +0000 (16:36 -0500)]
4520: Coerce unicode strings to ascii in put(). Use result.content (returns
literal result bytes) instead result.text (returns unicode) when processing
Keep responses.
Peter Amstutz [Mon, 23 Feb 2015 20:17:19 +0000 (15:17 -0500)]
4520: manifest_text() is utf-8 encoded by default so it can be safely put() to
Keep. Add test that calling put() with a unicode string raises an error.
Fetching user uuid in arv-copy uses num_retries.
Peter Amstutz [Mon, 23 Feb 2015 19:17:12 +0000 (14:17 -0500)]
4520: Better checking to see if collection already exists at the destination.
Set args.project_uuid to default value (current user uuid) if not set on
command line to simplify code.
Peter Amstutz [Mon, 23 Feb 2015 18:54:47 +0000 (13:54 -0500)]
4520: Refactor code to create the collection record. Also refactored code
which creates Docker metadata links so that copying any collection which
represents a Docker image will also copy over the metadata.
Phil Hodgson [Sat, 21 Feb 2015 09:16:31 +0000 (10:16 +0100)]
4232: revert experimental change to using find? for each of the jobs in a pipeline, rather than simply a where clause: there is no evidence that this switch to find? was helping to speed up anything overall
Brett Smith [Mon, 16 Feb 2015 16:06:41 +0000 (11:06 -0500)]
4138: Prepare Node Manager GCE driver for production.
* Set node metadata in more appropriate places.
* Bridge more differences between GCE and EC2, like the fact that
sizes are listed for each location they're available, and GCE
doesn't provide node boot times.
* Use more infrastructure from BaseComputeNodeDriver to reduce code
duplication.
* Load as many objects as possible at initialization time, to reduce
API overhead of creating nodes.
Brett Smith [Fri, 13 Feb 2015 20:24:04 +0000 (15:24 -0500)]
4138: Revamp Node Manager driver proxying in BaseComputeNodeDriver.
Accessing attributes through a super() proxy does not invoke
__getattr__ on base classes, so the old implementation made it
impossible for subclasses to be agnostic about whether a method was
implemented in BaseComputeNodeDriver or the real libcloud driver.
This version makes that possible. It's also a little nicer because
now the class will report these method names to dir(), hasattr(), etc.
Brett Smith [Thu, 12 Feb 2015 20:53:16 +0000 (15:53 -0500)]
4138: Fix noop Node Manager EC2 driver tests.
The previous tests simply instantiated the driver, then checked that a
mock method was truthy (which it will always be). This makes the test
work as intended.
Brett Smith [Wed, 11 Feb 2015 20:12:37 +0000 (15:12 -0500)]
4138: Simplify Node Manager GCE credential handling.
Because libcloud's GCE driver accepts a key path as a constructor
argument, it's relatively straightforward to put all the constructor
arguments directly in the Node Manager configuration. No need to
parse out JSON.
Tim Pierce [Fri, 23 Jan 2015 22:24:54 +0000 (17:24 -0500)]
4138: GCE fixes
The 'network_id' parameter needs to be delivered as 'location' in GCE.
The ping_url parameter is now delivered in the node metadata as
'pingUrl'.
When creating a new GCE instance, 'name' is a required parameter and
must begin with a letter. The default name is the UUID of the
corresponding Arvados node, prepended with 'arv-'.
Tim Pierce [Wed, 21 Jan 2015 18:06:35 +0000 (13:06 -0500)]
4138: general GCE fixes
* JSON credential file
** GCE credentials are delivered as a JSON string (and the key is formatted as a multi-line RSA private key). Let the GCE config file specify a path to the JSON credential file for simplicity.
* Accept NodeSizes addressed by id or name
** In EC2, NodeSizes are identified by the 'id' field. In GCE they are identified by the 'name' field. Allow the Node Manager config module to accept either.