crunch-dispatch needs $HOME set in order to do some path
manipulation. We set it to $(pwd) in production, which should
normally be the runit service definition directory.
This seems weird and it seems likely that there's a better value, but
until we investigate that and figure that out, documenting what we
have in production is better than letting crunch-dispatch crash.
Peter Amstutz [Thu, 19 Nov 2015 19:44:42 +0000 (14:44 -0500)]
5353: Remove checks that cloud_node.size is None (because it should never be None or
booting multiple node sizes won't work). Set size explicitly for the dummy driver.
radhika [Thu, 19 Nov 2015 17:44:15 +0000 (12:44 -0500)]
7490: Add Err to collection.ReadCollections and keep.ServerResponse so that the error can be propagated to clients accessing these through a channel read.
Brett Smith [Mon, 9 Nov 2015 15:13:21 +0000 (10:13 -0500)]
6923: Improve Arvados SDK version logging in Crunch run script.
* Use a mechanism that works in a wider variety of containers. This
only depends on Python itself and setuptools. It won't generate
spurious warnings by calling dpkg-query on Red Hat containers.
* Don't log the version when we successfully set up the specified
arvados_sdk_version. The version will only be '0.1' in this case,
and that's not helpful.
sguthrie [Tue, 10 Nov 2015 20:23:18 +0000 (15:23 -0500)]
Closes #7235. Instead of setting KeepService's pycurl.TIMEOUT_MS, set pycurl.LOW_SPEED_LIMIT and pycurl.LOW_SPEED_TIME.
Default LOW_SPEED_LIMIT is 32768 bytes per second. Default LOW_SPEED_TIME is 64 seconds.
If the user specifies a length-two tuple, the first item sets CONNECTTIMEOUT_MS, the second item sets LOW_SPEED_TIME,
and LOW_SPEED_LIMIT is set to 32768 bytes per second.
Added bandwidth similator to keepstub, which uses millisecond precision (like curl) to measure timeouts.
Added tests to test_keep_client and modified existing tests to only use integers.
Brett Smith [Wed, 11 Nov 2015 22:08:39 +0000 (17:08 -0500)]
7696: Improve PySDK KeepClient.ThreadLimiter.
* Move the calculation of how many threads to allow into the class.
* Teach it to handle cases where max_replicas_per_service is known and
greater than 1. This will never happen today, but is an anticipated
improvement.
* Update docstrings to reflect current reality.
These are all changes I made while debugging the previous race
condition.
Brett Smith [Wed, 11 Nov 2015 21:50:18 +0000 (16:50 -0500)]
7696: PySDK determines max_replicas_per_service after querying services.
Because max_replicas_per_service was set to 1 in the case where
KeepClient was instantiated with no direct information about available
Keep services, and because ThreadLimiter was being instantiated before
querying available Keep services (via map_new_services), the first
Keep request to talk to non-disk services would let multiple threads
run at once. This fixes that race condition, and adds a test that was
triggering it semi-reliably.
Brett Smith [Mon, 9 Nov 2015 15:28:51 +0000 (10:28 -0500)]
7123: Crunch doesn't update job log when arv-put fails.
This prevents crunch-job from recording the empty collection as a
job's log. Most other components (Workbench, the log clenaer)
recognize a null log as a special case; less so the empty collection.
Brett Smith [Mon, 9 Nov 2015 13:30:14 +0000 (08:30 -0500)]
6356: crunch-job doesn't create new tasks after job success is set.
#6356 reported that a permanently failed task was retried. Note 3
discusses why this happened and suggests two fixes:
* Only put tempfailed task back on the todo list.
* Run `last THISROUND if $main::please_freeze || defined($main::success);`
after we call reapchildren(), since it's the main place where the
value of $main::success can change.
The first change would revert part of 75be7487c2bbd83aa5116aa5f8ade5ddf31501da, which intentionally puts
these tasks back on the todo list to get a correct tasks count.
The current `last if…` line was added in b306eb48ab12676ffb365ede8197e4f2d7e92011, with the rationale "Don't
create new tasks if $main::success is defined." This change corrects
the code to implement the desired functionality, by checking and
stopping just before we create a new task (functionally, at least).