Peter Amstutz [Wed, 7 Jan 2015 19:38:41 +0000 (14:38 -0500)]
4312: Use "install" phase of bootstrap script to report the installed versions
of any arvados pip or debian packages. Like virtualenv logic, only reports for
task 0 (since every task starts the same image).
Brett Smith [Fri, 19 Dec 2014 22:40:13 +0000 (17:40 -0500)]
4836: Trigger Workbench infinite scroll load on tab show.
If an infinite scroller is in the first tab of a show page, but the
user is going to a different tab, we'll queue up the first event
to load data for the container, but when it fires the container won't
be visible so it will decline to load anything. Then you can only get
data to load if you resize the window.
Fire a scroll event when a new tab is shown, to spur the infinite
scroller to load data as appropriate.
Peter Amstutz [Mon, 29 Dec 2014 17:32:38 +0000 (12:32 -0500)]
4869: Correctly handle zero-length blocks in Keep client/Keep proxy. Remove
X-Block-Size. Choose default request timeout based on if client is talking to
a proxy or not. Use double quotes in logging. Rename "tag" to "requestId".
Peter Amstutz [Mon, 29 Dec 2014 14:09:13 +0000 (09:09 -0500)]
4869: KeepClient now has a default timeout per block request (10 minutes). In
keepproxy, the timeout is set to 20 seconds per block. Also rearranged some
keepclient and keepproxy logging to provide better information.
Tom Clegg [Sun, 21 Dec 2014 00:28:56 +0000 (19:28 -0500)]
4875: Let the OS choose port numbers for fake servers.
Fixes a race condition where test case N+1 can't listen on port 2990
because test case N hasn't shut down its listener.
Also removes the artificial acceptance requirement that nobody else on
the testing host is using the arbitrarily assigned port range
2990..299x.
Incidental changes:
* rename RunBogusKeepServer to RunFakeKeepServer (to match
RunSomeFakeKeepServers and fix the misleading implication that the
resulting server does something bogus).
* return a KeepServer object from RunFakeKeepServer (for better parity
with RunSomeFakeKeepServers).
Brett Smith [Fri, 19 Dec 2014 17:09:17 +0000 (12:09 -0500)]
4844: Node Manager doesn't treat min_nodes as min_nodes_idle.
There's a bad interaction between the past bugfixes to (a) implement
min_nodes, and (b) boot new nodes when existing nodes are busy.
Because min_nodes has been implemented at the server wishlist level in
the past, the daemon can't distinguish between "nodes requested to
fulfill min_nodes" and "nodes requested to fulfill jobs."
This commit puts all the responsibility for enforcing min_nodes in the
daemon, so that the server wishlist always represents real job
requirements. This lets the daemon correctly decide whether or not to
boot a new node when >= min_nodes are busy.
Brett Smith [Fri, 12 Dec 2014 21:16:39 +0000 (16:16 -0500)]
4670: Add a post-create hook to Node Manager for EC2 tagging.
The previous code was relying on the post-create tagging in libcloud's
EC2 driver. Unfortunately, that's not working out too well for us: if
it fails, you get no indication of that, and it doesn't get retried.
This moves the work up into Node Manager, where failures can be logged
and retried appropriately.
The retry support may be sufficient to resolve #4670. If it's not,
then the additional logging will help us track down the root cause.
Brett Smith [Fri, 12 Dec 2014 18:18:51 +0000 (13:18 -0500)]
4670: Node Manager handles more libcloud exceptions.
libcloud compute drivers (at least EC2 and GCE) raise bare Exceptions
when there's some problem talking to the cloud service. The previous
code was expecting to see a LibcloudError, so it wouldn't handle these
errors as intended.
I didn't want to just catch errors with "except Exception" everywhere,
so I added an is_cloud_exception class method to our driver classes to
more accurately identify exceptions that represent trouble talking to
the cloud service. It recognizes exact Exceptions, plus the other
classes we were catching before.
While I was at this, I gave more specific names to the wrapper methods
in compute node actor decorators, as a debugging aid.
Brett Smith [Wed, 10 Dec 2014 21:40:13 +0000 (16:40 -0500)]
4481: Refresh Crunch script tutorial page.
* The script now normalizes the output path, for consistency with
other scripts, and it looks nicer.
* Modernize the job log output slightly, and adjust text to match.
Brett Smith [Fri, 5 Dec 2014 22:45:13 +0000 (17:45 -0500)]
4380: Node Manager SLURM dispatcher proceeds from more states.
Per discussion with Ward. Our main concern is that Node Manager
shouldn't shut down nodes that are doing work. We feel comfortable
broadening the definition of "not doing work" to this set of states.
* crunch-dispatch fetches the requested SDK version into its internal
git repository, just like it does for the Crunch script. Refactored
crunch-dispatch to make that code reusable.
* crunch-job's main script archives the sdk subdirectory as of that
commit, sending it along to compute nodes in the same .tar as the
Crunch script, under .arvados.sdk.
* crunch-job's __DATA__ dispatch section looks for the SDK under
.arvados.sdk, and installs it as much as possible.
Since I was messing with it so much already, I changed the semantics
of crunch-job's __DATA__ section: it is now either in installation
mode or run mode, based on whether there's anything in @ARGV. I
confirmed that this is consistent with current calls to the section.
Brett Smith [Mon, 24 Nov 2014 20:53:00 +0000 (15:53 -0500)]
4027: Revamp SSH use in our Docker images.
* Don't install or run SSH in most of our Docker images. `docker
exec` is now preferred to inspect running images.
* Do run SSH on the API server, always, for Gitolite.
There is a feature regression here: the user's SSH key is not
automatically installed on the shell account. This needs to be fixed
another way. In the meantime, it's not difficult to run
`docker exec -ti --user=self shell /bin/bash`, and you can clone the
repository from the host system.