Peter Amstutz [Thu, 11 Jun 2015 17:12:21 +0000 (13:12 -0400)]
3198: arvfile.py and collection.py updates based on feedback.
* Add retryable ERROR state to buffer blocks.
* Fix race conditions in testing a bufferblock state before trying to change state.
* Change "modified" to "committed" (which has the opposite meaning, but is more
accurate the way the flag is used)
* Refactor block manager background upload/download threads
* Fix Collection.mkdirs() to match behavior of os.makedirs()
Nico Cesar [Tue, 9 Jun 2015 18:48:29 +0000 (14:48 -0400)]
qr1hi-automated-performance-suite is failing because the test doesnt give enough time for the page to render (now 50s makes the test past.). no issue #
Brett Smith [Wed, 3 Jun 2015 20:59:24 +0000 (16:59 -0400)]
5790: Improve Docker image listing in Python SDK.
* Always fetch all relevant Docker links.
* Support finding images by image hash.
* Show image hashes when listing images by name.
* Like Docker itself, when an image has multiple names and we're not
filtering by name, list each one.
* Better match the API server's priority logic:
* Ignore links to collections that aren't found.
* Links with an image_timestamp always have priority over those that
don't, regardless of their respective created_at timestamps.
The main motivation for this change is to make sure arv-copy gets the
right Docker image when copying a pipeline template recursively.
This implementation goes through some trouble to parse timestamps out
of each Docker link only once.
Tom Clegg [Thu, 4 Jun 2015 04:52:48 +0000 (00:52 -0400)]
6074: Use each instead of find_each, so our order() and limit() constraints are respected.
According to http://apidock.com/rails/ActiveRecord/Batches/find_each,
both order and limit are ignored.
The existing test "max_index_database_read does not interfere with
order" had two fatal bugs that prevented it from catching the
find_each problem:
1. It didn't select the 'name' column, so the 'name' order was
ignored. But the test passed because the name wasn't returned,
item['name'] was nil, and... nil !~ /pattern/ is true. Both
problems are fixed here.
(This explains why it seemed to find 15 names starting with
Collection_9, rather than just the 11 that exist (_9 and _90..99).)
2. It tested only that the returned results followed the requested
order, not that the order was followed when deciding what the limit
should be. All of the items_available were the same size, so _any_
order would have set the limit at 15 and passed the test.
The second problem is fixed by adding a separate test.
Tom Clegg [Wed, 3 Jun 2015 20:28:01 +0000 (16:28 -0400)]
6074: Clear any existing ActiveRecord select() before adding our own,
otherwise we'll read all the big values when we're really just trying
to predict the result size.
Tom Clegg [Mon, 25 May 2015 14:35:02 +0000 (10:35 -0400)]
6087: Use HTTPClient's compression feature (instead of adding the
Content-Encoding header ourselves). Rename config knob to describe
purpose instead of implementation.
Tom Clegg [Sun, 31 May 2015 06:02:19 +0000 (02:02 -0400)]
6146: Improvements to "kill srun process if slurm task disappears" feature:
* Log when we notice a process is orphaned.
* Log when we decide to kill an orphaned process.
* Use `squeue --jobs $SLURM_JOBID` so slurm doesn't have to tell us
about other jobs' tasks.
* Do not kill a process that is still reporting stderr.
* Do not check `squeue` at all if every process has reported stderr
since the last squeue check. (In such cases, it seems safe to assume
no children are hung/dead.)
* Use the same timer/interval (15 seconds) for both noticing and
killing orphaned processes.
Tom Clegg [Thu, 28 May 2015 21:13:44 +0000 (17:13 -0400)]
6146: Exit TEMPFAIL early (without failing the job) if worker nodes cannot run a trivial command.
This is meant to improve the way we handle a couple of edge cases.
1. A worker node doesn't get bootstrapped properly. It works well
enough to persuade nodemanager and the API server that it's alive and
ready to run jobs, but it can't actually run jobs. This means there's
a bug in the bootstrapping process -- its startup script shouldn't
tell slurm State=RESUME without checking itself -- but even so this
doesn't deserve to fail a job: it's definitely a system problem,
there's zero chance a different job would have gone any differently.
2. A worker node has a hardware problem, or it has fallen off the
network, or something like that, but slurm hasn't yet noticed and set
its state to DOWN, so slurm still uses it to satisfy crunch-dispatch's
"salloc" commands. As above, there's zero chance this could have gone
differently for any other job, so it doesn't make sense to fail the
job.
Peter Amstutz [Fri, 22 May 2015 20:13:22 +0000 (16:13 -0400)]
6141: Remove hard-coded "https://" from "https://{{site.arvados_workbench_host}}" and require that arvados_workbench_host include the url scheme instead.