arvados.git
8 years agoMerge branch '5353-node-sizes' closes #5353
Peter Amstutz [Wed, 18 Nov 2015 17:09:47 +0000 (12:09 -0500)]
Merge branch '5353-node-sizes' closes #5353

8 years ago6846: Streamline Workbench 404 page.
Brett Smith [Wed, 18 Nov 2015 16:36:04 +0000 (11:36 -0500)]
6846: Streamline Workbench 404 page.

* Prompt the user to log in with a prominent button.
* Make the page text less verbose.
* DRY up the code in the _report_error partial.

Refs #6846.

8 years ago5353: Remove extra assertion because busywait does it for us.
Peter Amstutz [Wed, 18 Nov 2015 15:18:03 +0000 (10:18 -0500)]
5353: Remove extra assertion because busywait does it for us.

8 years ago5353: Update comment about min_nodes and node size.
Peter Amstutz [Wed, 18 Nov 2015 14:52:46 +0000 (09:52 -0500)]
5353: Update comment about min_nodes and node size.

8 years ago5353: Add a couple comments to tests.
Peter Amstutz [Wed, 18 Nov 2015 14:45:08 +0000 (09:45 -0500)]
5353: Add a couple comments to tests.

8 years ago5353: Fix typo in _nodes_wanted(). Calculate number of nodes that can boot
Peter Amstutz [Wed, 18 Nov 2015 14:25:32 +0000 (09:25 -0500)]
5353: Fix typo in _nodes_wanted().  Calculate number of nodes that can boot
based on price cap.  Don't add jobs to wishlist that exceed max price cap.

8 years agoCloses #7235. Merge branch '7235-python-keep-client-timeout'
sguthrie [Tue, 17 Nov 2015 21:49:04 +0000 (16:49 -0500)]
Closes #7235. Merge branch '7235-python-keep-client-timeout'

8 years agoCloses #7235. Instead of setting KeepService's pycurl.TIMEOUT_MS, set pycurl.LOW_SPEE...
sguthrie [Tue, 10 Nov 2015 20:23:18 +0000 (15:23 -0500)]
Closes #7235. Instead of setting KeepService's pycurl.TIMEOUT_MS, set pycurl.LOW_SPEED_LIMIT and pycurl.LOW_SPEED_TIME.
Default LOW_SPEED_LIMIT is 32768 bytes per second. Default LOW_SPEED_TIME is 64 seconds.
If the user specifies a length-two tuple, the first item sets CONNECTTIMEOUT_MS, the second item sets LOW_SPEED_TIME,
and LOW_SPEED_LIMIT is set to 32768 bytes per second.

Added bandwidth similator to keepstub, which uses millisecond precision (like curl) to measure timeouts.
Added tests to test_keep_client and modified existing tests to only use integers.

8 years ago7313: crunch-job reports an error when a task doesn't record state.
Brett Smith [Tue, 17 Nov 2015 03:42:31 +0000 (22:42 -0500)]
7313: crunch-job reports an error when a task doesn't record state.

Closes #7313.

8 years ago5353: Fixes from testing with Dummy driver.
Peter Amstutz [Mon, 16 Nov 2015 22:01:05 +0000 (17:01 -0500)]
5353: Fixes from testing with Dummy driver.

8 years ago5353: Add note that min_nodes boots cheapest nodes.
Peter Amstutz [Mon, 16 Nov 2015 21:25:48 +0000 (16:25 -0500)]
5353: Add note that min_nodes boots cheapest nodes.

8 years ago5353: Added max_total_price. Added more tests for multiple node sizes.
Peter Amstutz [Mon, 16 Nov 2015 21:21:34 +0000 (16:21 -0500)]
5353: Added max_total_price.  Added more tests for multiple node sizes.
Updated config file examples.

8 years agoMerge branch '7696-pysdk-all-keep-service-types-wip'
Brett Smith [Fri, 13 Nov 2015 14:29:40 +0000 (09:29 -0500)]
Merge branch '7696-pysdk-all-keep-service-types-wip'

Closes #7696, #7758.

8 years ago7696: Improve PySDK KeepClient.ThreadLimiter.
Brett Smith [Wed, 11 Nov 2015 22:08:39 +0000 (17:08 -0500)]
7696: Improve PySDK KeepClient.ThreadLimiter.

* Move the calculation of how many threads to allow into the class.
* Teach it to handle cases where max_replicas_per_service is known and
  greater than 1.  This will never happen today, but is an anticipated
  improvement.
* Update docstrings to reflect current reality.

These are all changes I made while debugging the previous race
condition.

8 years ago7696: PySDK determines max_replicas_per_service after querying services.
Brett Smith [Wed, 11 Nov 2015 21:50:18 +0000 (16:50 -0500)]
7696: PySDK determines max_replicas_per_service after querying services.

Because max_replicas_per_service was set to 1 in the case where
KeepClient was instantiated with no direct information about available
Keep services, and because ThreadLimiter was being instantiated before
querying available Keep services (via map_new_services), the first
Keep request to talk to non-disk services would let multiple threads
run at once.  This fixes that race condition, and adds a test that was
triggering it semi-reliably.

8 years ago7696: PySDK KeepClient uses all service types.
Brett Smith [Wed, 11 Nov 2015 17:17:46 +0000 (12:17 -0500)]
7696: PySDK KeepClient uses all service types.

Filter out gateway services from the list of usable services, rather
than selecting only disk and proxy types.

8 years ago7696: Clean imports in PySDK arvados.keep module.
Brett Smith [Wed, 11 Nov 2015 17:18:46 +0000 (12:18 -0500)]
7696: Clean imports in PySDK arvados.keep module.

8 years ago7696: Refactor locator builder method in PySDK tests.
Brett Smith [Wed, 11 Nov 2015 15:06:51 +0000 (10:06 -0500)]
7696: Refactor locator builder method in PySDK tests.

8 years agoMerge branch '7123-crunch-no-record-log-failure-wip'
Brett Smith [Fri, 13 Nov 2015 14:28:12 +0000 (09:28 -0500)]
Merge branch '7123-crunch-no-record-log-failure-wip'

Closes #7123, #7741.

8 years ago7123: Crunch doesn't update job log when arv-put fails.
Brett Smith [Mon, 9 Nov 2015 15:28:51 +0000 (10:28 -0500)]
7123: Crunch doesn't update job log when arv-put fails.

This prevents crunch-job from recording the empty collection as a
job's log.  Most other components (Workbench, the log clenaer)
recognize a null log as a special case; less so the empty collection.

8 years agoMerge branch '7645-doc-client-max-body-size-wip'
Brett Smith [Thu, 12 Nov 2015 21:33:48 +0000 (16:33 -0500)]
Merge branch '7645-doc-client-max-body-size-wip'

Closes #7645, #7742.  Refs #7356.

8 years ago7356: Install guide sets client_max_body_size for arv-git-httpd.
Brett Smith [Mon, 9 Nov 2015 17:44:38 +0000 (12:44 -0500)]
7356: Install guide sets client_max_body_size for arv-git-httpd.

8 years ago7645: Install guide suggests setting client_max_body_size consistently.
Brett Smith [Mon, 9 Nov 2015 17:43:58 +0000 (12:43 -0500)]
7645: Install guide suggests setting client_max_body_size consistently.

Without these changes, the upstream Passenger processes may reject
large request bodies.

8 years agoMerge branch '6846-workbench-top-nav-login-returns-wip'
Brett Smith [Thu, 12 Nov 2015 21:12:28 +0000 (16:12 -0500)]
Merge branch '6846-workbench-top-nav-login-returns-wip'

Closes #6846, #7739.

8 years ago6846: Workbench navigation bar login returns user to the same page.
Brett Smith [Mon, 9 Nov 2015 17:02:25 +0000 (12:02 -0500)]
6846: Workbench navigation bar login returns user to the same page.

8 years agoMerge branch '6356-crunch-permfail-task-retry-fix-wip'
Brett Smith [Thu, 12 Nov 2015 20:31:09 +0000 (15:31 -0500)]
Merge branch '6356-crunch-permfail-task-retry-fix-wip'

Closes #6356, #7738.

8 years ago6356: crunch-job doesn't create new tasks after job success is set.
Brett Smith [Mon, 9 Nov 2015 13:30:14 +0000 (08:30 -0500)]
6356: crunch-job doesn't create new tasks after job success is set.

#6356 reported that a permanently failed task was retried.  Note 3
discusses why this happened and suggests two fixes:

* Only put tempfailed task back on the todo list.
* Run `last THISROUND if $main::please_freeze || defined($main::success);`
  after we call reapchildren(), since it's the main place where the
  value of $main::success can change.

The first change would revert part of
75be7487c2bbd83aa5116aa5f8ade5ddf31501da, which intentionally puts
these tasks back on the todo list to get a correct tasks count.

The current `last if…` line was added in
b306eb48ab12676ffb365ede8197e4f2d7e92011, with the rationale "Don't
create new tasks if $main::success is defined."  This change corrects
the code to implement the desired functionality, by checking and
stopping just before we create a new task (functionally, at least).

8 years agoMerge branch '5824-keep-web-workbench' closes #5824
Tom Clegg [Thu, 12 Nov 2015 20:00:59 +0000 (15:00 -0500)]
Merge branch '5824-keep-web-workbench' closes #5824

8 years ago5824: Fix clear-download-dir helper.
Tom Clegg [Wed, 11 Nov 2015 23:32:50 +0000 (18:32 -0500)]
5824: Fix clear-download-dir helper.

8 years ago5824: Fix path and query escapes.
Tom Clegg [Wed, 11 Nov 2015 23:32:23 +0000 (18:32 -0500)]
5824: Fix path and query escapes.

Paths encode spaces as "%20", not "+".

Rails to_query helper does undesirable things like
"disposition[]=attachment".

8 years ago5824: Fix -attachment-only-host test config. Test more preview/download variants.
Tom Clegg [Wed, 11 Nov 2015 23:29:39 +0000 (18:29 -0500)]
5824: Fix -attachment-only-host test config. Test more preview/download variants.

8 years agoMerge branch '5824-keep-web-workbench' refs #5824
Tom Clegg [Wed, 11 Nov 2015 17:14:16 +0000 (12:14 -0500)]
Merge branch '5824-keep-web-workbench' refs #5824

8 years ago5824: Merge branch 'master' into 5824-keep-web-workbench
Tom Clegg [Wed, 11 Nov 2015 17:11:46 +0000 (12:11 -0500)]
5824: Merge branch 'master' into 5824-keep-web-workbench

Conflicts:
services/keepproxy/keepproxy_test.go

8 years agocloses #7661
radhika [Wed, 11 Nov 2015 16:01:24 +0000 (11:01 -0500)]
closes #7661
Merge branch '7661-fuse-by-pdh'

8 years agoMerge branch 'master' into 7661-fuse-by-pdh
radhika [Wed, 11 Nov 2015 16:01:02 +0000 (11:01 -0500)]
Merge branch 'master' into 7661-fuse-by-pdh

8 years ago5824: Update/clarify docs and comments.
Tom Clegg [Wed, 11 Nov 2015 01:48:24 +0000 (20:48 -0500)]
5824: Update/clarify docs and comments.

8 years ago7661: Pass pdh_only when adding by_id subdir; test now passes.
radhika [Tue, 10 Nov 2015 23:41:55 +0000 (18:41 -0500)]
7661: Pass pdh_only when adding by_id subdir; test now passes.

8 years agoMerge branch '5538-test-post-retry' refs #5538
Tom Clegg [Tue, 10 Nov 2015 16:35:03 +0000 (11:35 -0500)]
Merge branch '5538-test-post-retry' refs #5538

8 years ago5538: Update comments to match new tests.
Tom Clegg [Tue, 10 Nov 2015 16:33:32 +0000 (11:33 -0500)]
5538: Update comments to match new tests.

8 years ago7661: added test with only_pdh (not working yet)
radhika [Tue, 10 Nov 2015 15:52:35 +0000 (10:52 -0500)]
7661: added test with only_pdh (not working yet)

8 years ago5538: Test that POST method is not retried.
Tom Clegg [Tue, 10 Nov 2015 15:10:55 +0000 (10:10 -0500)]
5538: Test that POST method is not retried.

8 years agoUse a different port number for each test case. No issue #
Tom Clegg [Tue, 10 Nov 2015 07:20:34 +0000 (02:20 -0500)]
Use a different port number for each test case. No issue #

8 years ago5824: Support configuration with a download-only host.
Tom Clegg [Tue, 10 Nov 2015 06:29:11 +0000 (01:29 -0500)]
5824: Support configuration with a download-only host.

8 years agoMerge branch 'master' into 7661-fuse-by-pdh
radhika [Mon, 9 Nov 2015 20:41:46 +0000 (15:41 -0500)]
Merge branch 'master' into 7661-fuse-by-pdh

8 years ago5824: Preserve query in keep_web_url template. Warn when redirecting preview to a...
Tom Clegg [Mon, 9 Nov 2015 20:00:14 +0000 (15:00 -0500)]
5824: Preserve query in keep_web_url template. Warn when redirecting preview to a single-origin keep_web_url.

8 years agoMerge branch '3585-arpi-project-uuid-wip' closes #3585
Peter Amstutz [Mon, 9 Nov 2015 19:33:09 +0000 (14:33 -0500)]
Merge branch '3585-arpi-project-uuid-wip' closes #3585

8 years agoMerge branch 'master' into 7661-fuse-by-pdh
radhika [Mon, 9 Nov 2015 19:01:17 +0000 (14:01 -0500)]
Merge branch 'master' into 7661-fuse-by-pdh

8 years agocloses #5538
radhika [Mon, 9 Nov 2015 18:54:29 +0000 (13:54 -0500)]
closes #5538
Merge branch '5538-arvadosclient-retry'

8 years ago5538: update the test case for "error" to use better stub parameters with nil status...
radhika [Mon, 9 Nov 2015 18:49:31 +0000 (13:49 -0500)]
5538: update the test case for "error" to use better stub parameters with nil status codes and response body to avoid any confusion to the reader.

8 years ago7661: rename MagiDirectory by_pdh as pdh_only
radhika [Mon, 9 Nov 2015 16:21:35 +0000 (11:21 -0500)]
7661: rename MagiDirectory by_pdh as pdh_only

8 years agoMerge branch 'master' into 7661-fuse-by-pdh
radhika [Mon, 9 Nov 2015 15:43:13 +0000 (10:43 -0500)]
Merge branch 'master' into 7661-fuse-by-pdh

8 years ago5353: Add a couple of tests to explicitly create nodes of different sizes
Peter Amstutz [Mon, 9 Nov 2015 14:27:38 +0000 (09:27 -0500)]
5353: Add a couple of tests to explicitly create nodes of different sizes

8 years ago5538: add a test that simulates error during requesting server so that we can test...
radhika [Mon, 9 Nov 2015 13:38:29 +0000 (08:38 -0500)]
5538: add a test that simulates error during requesting server so that we can test the error path as well.

8 years ago3585: Add --project-uuid switch to a-r-p-i.
Brett Smith [Mon, 9 Nov 2015 11:05:28 +0000 (06:05 -0500)]
3585: Add --project-uuid switch to a-r-p-i.

8 years ago5824: Add anonymous-404 and download-by-pdh tests.
Tom Clegg [Mon, 9 Nov 2015 08:28:50 +0000 (03:28 -0500)]
5824: Add anonymous-404 and download-by-pdh tests.

8 years ago5824: Propagate non-token parts of query string (notably ?attachment=disposition...
Tom Clegg [Sun, 8 Nov 2015 20:52:29 +0000 (15:52 -0500)]
5824: Propagate non-token parts of query string (notably ?attachment=disposition) when redirecting.

8 years ago5824: Support partial content with Range header (only if start==0).
Tom Clegg [Sun, 8 Nov 2015 11:39:05 +0000 (06:39 -0500)]
5824: Support partial content with Range header (only if start==0).

8 years ago5824: Fix disposition=attachment handling.
Tom Clegg [Sat, 7 Nov 2015 09:36:01 +0000 (04:36 -0500)]
5824: Fix disposition=attachment handling.

Propagate disposition=attachment from Workbench to keep-web when
redirecting.

Include a filename in the Content-Disposition header if the request
URL contains "?", so UAs don't mistakenly include the query string as
part of the default filename.

8 years ago5824: Fixup new keepproxy tests to use simplified test setup.
Tom Clegg [Sat, 7 Nov 2015 09:06:47 +0000 (04:06 -0500)]
5824: Fixup new keepproxy tests to use simplified test setup.

See 813d35123538b00ab70719e247b6bb0881269460

8 years ago5824: Move "periodically refresh Keep services" func from keepproxy to SDK.
Tom Clegg [Sat, 7 Nov 2015 09:03:27 +0000 (04:03 -0500)]
5824: Move "periodically refresh Keep services" func from keepproxy to SDK.

8 years ago5824: Fix server shutdown code.
Tom Clegg [Sat, 7 Nov 2015 09:00:50 +0000 (04:00 -0500)]
5824: Fix server shutdown code.

* Pay attention to --num-keep-servers in stop_keep.

* Wait for processes to exit, to avoid start/stop races.

* Tighten exception handling in kill_server_pid() and warn instead of
  crashing in various races.

* Log TERM signals.

* Log when a server does not shut down within the given deadline.

8 years ago5824: Fix Keep server shutdown, check errors, simplify stderr redirection.
Tom Clegg [Sat, 7 Nov 2015 08:54:03 +0000 (03:54 -0500)]
5824: Fix Keep server shutdown, check errors, simplify stderr redirection.

(Oops, we forgot to actually Run() the python command for stop_keep.)

8 years ago5538: update the test to set resp.body with the given string from stub than hard...
radhika [Sat, 7 Nov 2015 14:25:48 +0000 (09:25 -0500)]
5538: update the test to set resp.body with the given string from stub than hard code it (overlooked in previous commit)

8 years ago5538: correct retryable list and use it to determine whether to close idle connection...
radhika [Sat, 7 Nov 2015 14:00:49 +0000 (09:00 -0500)]
5538: correct retryable list and use it to determine whether to close idle connections; add a few more test cases.

8 years agoMerge branch 'master' into 5538-arvadosclient-retry
radhika [Sat, 7 Nov 2015 13:42:38 +0000 (08:42 -0500)]
Merge branch 'master' into 5538-arvadosclient-retry

8 years ago5824: Use fifo2stderr for arv-git-httpd and keep-web logs, too.
Tom Clegg [Sat, 7 Nov 2015 07:22:07 +0000 (02:22 -0500)]
5824: Use fifo2stderr for arv-git-httpd and keep-web logs, too.

8 years ago5824: Sync test suite to new keep-web argument names.
Tom Clegg [Fri, 6 Nov 2015 21:58:32 +0000 (16:58 -0500)]
5824: Sync test suite to new keep-web argument names.

8 years ago5824: Merge branch 'master' into 5824-keep-web-workbench
Tom Clegg [Fri, 6 Nov 2015 21:53:01 +0000 (16:53 -0500)]
5824: Merge branch 'master' into 5824-keep-web-workbench

8 years ago5353: Existing tests pass now. (Still need to add a few tests that explicitly
Peter Amstutz [Fri, 6 Nov 2015 21:24:55 +0000 (16:24 -0500)]
5353: Existing tests pass now.  (Still need to add a few tests that explicitly
test multiple node sizes.)

8 years ago5353: Parameterize the following methods on node size: _nodes_up, _nodes_busy,
Peter Amstutz [Fri, 6 Nov 2015 17:02:03 +0000 (12:02 -0500)]
5353: Parameterize the following methods on node size: _nodes_up, _nodes_busy,
_nodes_missing, _nodes_wanted, _nodes_excess, start_node, and stop_booting_node.

Start fixing tests.

8 years ago5538: much simpler and neater api stub test case array; golint
radhika [Fri, 6 Nov 2015 15:11:27 +0000 (10:11 -0500)]
5538: much simpler and neater api stub test case array; golint

8 years ago7724: Use datamanager token in keep-rsync tests. refs #7724
Tom Clegg [Fri, 6 Nov 2015 04:17:32 +0000 (23:17 -0500)]
7724: Use datamanager token in keep-rsync tests. refs #7724

8 years agoMerge branch 'master' into 5538-arvadosclient-retry
radhika [Fri, 6 Nov 2015 03:19:54 +0000 (22:19 -0500)]
Merge branch 'master' into 5538-arvadosclient-retry

8 years ago5538: Merge FailHandler and FailThenSucceedHandler into one APIStub to facilitate...
radhika [Fri, 6 Nov 2015 03:17:59 +0000 (22:17 -0500)]
5538: Merge FailHandler and FailThenSucceedHandler into one APIStub to facilitate testing many more error states; also add update and delete retry tests.

8 years ago5538: code improvements; use switch statement instead of if statement with several...
radhika [Fri, 6 Nov 2015 01:13:32 +0000 (20:13 -0500)]
5538: code improvements; use switch statement instead of if statement with several status code checks, sleep between retries.

8 years ago5353: Give NodeManagerDaemonActor access to ServerCalculator object.
Peter Amstutz [Thu, 5 Nov 2015 19:40:59 +0000 (11:40 -0800)]
5353: Give NodeManagerDaemonActor access to ServerCalculator object.

8 years ago7724: Use datamanager token in keepproxy index test. refs #7724
Tom Clegg [Thu, 5 Nov 2015 19:35:38 +0000 (14:35 -0500)]
7724: Use datamanager token in keepproxy index test. refs #7724

8 years agoMerge branch '7724-scoped-token' closes #7724
Tom Clegg [Thu, 5 Nov 2015 18:46:20 +0000 (13:46 -0500)]
Merge branch '7724-scoped-token' closes #7724

8 years agoFix non-packaged API server paths in the install guide.
Brett Smith [Thu, 5 Nov 2015 17:12:17 +0000 (12:12 -0500)]
Fix non-packaged API server paths in the install guide.

No issue #.

8 years agoMerge branch '5824-keep-web' refs #5824
Tom Clegg [Thu, 5 Nov 2015 16:37:03 +0000 (11:37 -0500)]
Merge branch '5824-keep-web' refs #5824

8 years ago7724: Use a scoped token in data manager tests.
Tom Clegg [Thu, 5 Nov 2015 16:33:42 +0000 (11:33 -0500)]
7724: Use a scoped token in data manager tests.

8 years ago5824: Use ARVADOS_API_TOKEN=foo + -allow-anonymous instead of -anonymous-token=foo.
Tom Clegg [Thu, 5 Nov 2015 15:54:13 +0000 (10:54 -0500)]
5824: Use ARVADOS_API_TOKEN=foo + -allow-anonymous instead of -anonymous-token=foo.

8 years ago5824: Rename -address to -listen
Tom Clegg [Thu, 5 Nov 2015 15:11:06 +0000 (10:11 -0500)]
5824: Rename -address to -listen

8 years ago5538: update the newly added TestFail* to use proper client with http.Transport
radhika [Wed, 4 Nov 2015 22:18:25 +0000 (17:18 -0500)]
5538: update the newly added TestFail* to use proper client with http.Transport

8 years agoMerge branch 'master' into 5538-arvadosclient-retry
radhika [Wed, 4 Nov 2015 22:11:10 +0000 (17:11 -0500)]
Merge branch 'master' into 5538-arvadosclient-retry

Conflicts:
sdk/go/arvadosclient/arvadosclient.go

8 years agorefs #5538
radhika [Wed, 4 Nov 2015 21:39:59 +0000 (16:39 -0500)]
refs #5538
Merge branch '5538-close-idle-connections'

8 years ago5538: update test to reuse arvados client in TestCreatePipelineTemplate between idle...
radhika [Wed, 4 Nov 2015 21:38:28 +0000 (16:38 -0500)]
5538: update test to reuse arvados client in TestCreatePipelineTemplate between idle and current connections.

8 years agoMerge branch 'master' into 5538-close-idle-connections
radhika [Wed, 4 Nov 2015 21:25:32 +0000 (16:25 -0500)]
Merge branch 'master' into 5538-close-idle-connections

8 years agocloses #7719
radhika [Wed, 4 Nov 2015 21:19:51 +0000 (16:19 -0500)]
closes #7719
Merge branch '7719-permit-net-delete'

8 years ago7719: permit never-delte to be set to false; add warning that datamanager is not...
radhika [Wed, 4 Nov 2015 21:13:29 +0000 (16:13 -0500)]
7719: permit never-delte to be set to false; add warning that datamanager is not yet fully tested.

8 years ago5538: add test with a connection idle for longer than MaxIdleConnectionDuration
radhika [Wed, 4 Nov 2015 19:58:46 +0000 (14:58 -0500)]
5538: add test with a connection idle for longer than MaxIdleConnectionDuration

8 years agoMerge branch 'master' into 5538-close-idle-connections
radhika [Wed, 4 Nov 2015 19:36:42 +0000 (14:36 -0500)]
Merge branch 'master' into 5538-close-idle-connections

8 years agoMerge branch '7713-node-manager-blacklist-broken-nodes-wip'
Brett Smith [Wed, 4 Nov 2015 19:32:01 +0000 (14:32 -0500)]
Merge branch '7713-node-manager-blacklist-broken-nodes-wip'

Closes #7713, #7718.

8 years ago5538: using fake arvados server to generate errors, added tests with retries.
radhika [Wed, 4 Nov 2015 19:08:24 +0000 (14:08 -0500)]
5538: using fake arvados server to generate errors, added tests with retries.

8 years ago7713: Node Manager blackholes broken nodes that can't shut down.
Brett Smith [Wed, 4 Nov 2015 17:20:36 +0000 (12:20 -0500)]
7713: Node Manager blackholes broken nodes that can't shut down.

We are seeing situations on Azure where some nodes in an UNKNOWN state
cannot be shut down.  The API call to destroy them always fails.

There are two related halves to this commit.  In the first half,
after a cloud shutdown request fails, ComputeNodeShutdownActor checks
whether the node is broken.  If it is, it cancels shutdown retries.

In the second half, the daemon checks for this shutdown outcome.  When
it happens, it blacklists the broken node: it will immediately filter
it out of node lists from the cloud.  It is no longer monitored in any
way or counted as a live node, so Node Manager will boot a replacement
for it.

This lets Node Manager create cloud nodes above max_nodes, up to the
number of broken nodes.  We're reasonably bounded in for now because
only the Azure driver will ever declare a node broken.  Other clouds
will never blacklist nodes this way.

8 years agoMerge branch 'master' into 5538-arvadosclient-retry
radhika [Wed, 4 Nov 2015 16:36:24 +0000 (11:36 -0500)]
Merge branch 'master' into 5538-arvadosclient-retry

8 years ago5538: close any idle connections before a POST or DELETE request.
radhika [Wed, 4 Nov 2015 16:34:35 +0000 (11:34 -0500)]
5538: close any idle connections before a POST or DELETE request.

8 years ago5538: retry failed arvados api requests when appropriate.
radhika [Wed, 4 Nov 2015 15:13:53 +0000 (10:13 -0500)]
5538: retry failed arvados api requests when appropriate.

8 years agoMerge branch '7444-dockercleaner-containers' closes #7444
Tom Clegg [Wed, 4 Nov 2015 05:19:40 +0000 (00:19 -0500)]
Merge branch '7444-dockercleaner-containers' closes #7444

8 years agoMerge branch '5824-keep-web'
Tom Clegg [Wed, 4 Nov 2015 04:55:11 +0000 (23:55 -0500)]
Merge branch '5824-keep-web'

refs #5824