Brett Smith [Wed, 11 Nov 2015 22:08:39 +0000 (17:08 -0500)]
7696: Improve PySDK KeepClient.ThreadLimiter.
* Move the calculation of how many threads to allow into the class.
* Teach it to handle cases where max_replicas_per_service is known and
greater than 1. This will never happen today, but is an anticipated
improvement.
* Update docstrings to reflect current reality.
These are all changes I made while debugging the previous race
condition.
Brett Smith [Wed, 11 Nov 2015 21:50:18 +0000 (16:50 -0500)]
7696: PySDK determines max_replicas_per_service after querying services.
Because max_replicas_per_service was set to 1 in the case where
KeepClient was instantiated with no direct information about available
Keep services, and because ThreadLimiter was being instantiated before
querying available Keep services (via map_new_services), the first
Keep request to talk to non-disk services would let multiple threads
run at once. This fixes that race condition, and adds a test that was
triggering it semi-reliably.
Brett Smith [Mon, 9 Nov 2015 15:28:51 +0000 (10:28 -0500)]
7123: Crunch doesn't update job log when arv-put fails.
This prevents crunch-job from recording the empty collection as a
job's log. Most other components (Workbench, the log clenaer)
recognize a null log as a special case; less so the empty collection.
Brett Smith [Mon, 9 Nov 2015 13:30:14 +0000 (08:30 -0500)]
6356: crunch-job doesn't create new tasks after job success is set.
#6356 reported that a permanently failed task was retried. Note 3
discusses why this happened and suggests two fixes:
* Only put tempfailed task back on the todo list.
* Run `last THISROUND if $main::please_freeze || defined($main::success);`
after we call reapchildren(), since it's the main place where the
value of $main::success can change.
The first change would revert part of 75be7487c2bbd83aa5116aa5f8ade5ddf31501da, which intentionally puts
these tasks back on the todo list to get a correct tasks count.
The current `last if…` line was added in b306eb48ab12676ffb365ede8197e4f2d7e92011, with the rationale "Don't
create new tasks if $main::success is defined." This change corrects
the code to implement the desired functionality, by checking and
stopping just before we create a new task (functionally, at least).
Tom Clegg [Sat, 7 Nov 2015 09:36:01 +0000 (04:36 -0500)]
5824: Fix disposition=attachment handling.
Propagate disposition=attachment from Workbench to keep-web when
redirecting.
Include a filename in the Content-Disposition header if the request
URL contains "?", so UAs don't mistakenly include the query string as
part of the default filename.
radhika [Fri, 6 Nov 2015 03:17:59 +0000 (22:17 -0500)]
5538: Merge FailHandler and FailThenSucceedHandler into one APIStub to facilitate testing many more error states; also add update and delete retry tests.
radhika [Thu, 5 Nov 2015 14:39:04 +0000 (09:39 -0500)]
7490: The makeArvadosClient func, which is invoked by singlerun, should return error; not fatalf.
The main method expects error in all error cases and decides next action; when wait time in makeArvadosClient is provided it will retry.
Brett Smith [Wed, 4 Nov 2015 17:20:36 +0000 (12:20 -0500)]
7713: Node Manager blackholes broken nodes that can't shut down.
We are seeing situations on Azure where some nodes in an UNKNOWN state
cannot be shut down. The API call to destroy them always fails.
There are two related halves to this commit. In the first half,
after a cloud shutdown request fails, ComputeNodeShutdownActor checks
whether the node is broken. If it is, it cancels shutdown retries.
In the second half, the daemon checks for this shutdown outcome. When
it happens, it blacklists the broken node: it will immediately filter
it out of node lists from the cloud. It is no longer monitored in any
way or counted as a live node, so Node Manager will boot a replacement
for it.
This lets Node Manager create cloud nodes above max_nodes, up to the
number of broken nodes. We're reasonably bounded in for now because
only the Azure driver will ever declare a node broken. Other clouds
will never blacklist nodes this way.