Tim Pierce [Tue, 6 Jan 2015 16:03:10 +0000 (11:03 -0500)]
4598: account for queued and cancelled jobs, fix sorting
Per code review:
* Updated report to include job states "Cancelled" and "Queued" as well
as Failed, Running and Complete, and to take these into account when
calculating job counts.
* Fixed sorting for failure classes.
Tim Pierce [Mon, 5 Jan 2015 19:22:47 +0000 (14:22 -0500)]
4598: formatting and calculation fixes (code review)
Incorporating code review feedback from #4598-13.
Bugs fixed:
* Correct counting and percentage calculation of job failures.
** Jobs were getting categorized as both "unknown" and as a specific failure type.
* Crashes fixed: should not raise any unhandled exceptions.
Formatting fixes:
* Itemized failures are now sorted in descending order by failure type
* Better horizontal alignment
* Modified formatting to account for updated description.
* crunch-dispatch fetches the requested SDK version into its internal
git repository, just like it does for the Crunch script. Refactored
crunch-dispatch to make that code reusable.
* crunch-job's main script archives the sdk subdirectory as of that
commit, sending it along to compute nodes in the same .tar as the
Crunch script, under .arvados.sdk.
* crunch-job's __DATA__ dispatch section looks for the SDK under
.arvados.sdk, and installs it as much as possible.
Since I was messing with it so much already, I changed the semantics
of crunch-job's __DATA__ section: it is now either in installation
mode or run mode, based on whether there's anything in @ARGV. I
confirmed that this is consistent with current calls to the section.
Brett Smith [Mon, 24 Nov 2014 20:53:00 +0000 (15:53 -0500)]
4027: Revamp SSH use in our Docker images.
* Don't install or run SSH in most of our Docker images. `docker
exec` is now preferred to inspect running images.
* Do run SSH on the API server, always, for Gitolite.
There is a feature regression here: the user's SSH key is not
automatically installed on the shell account. This needs to be fixed
another way. In the meantime, it's not difficult to run
`docker exec -ti --user=self shell /bin/bash`, and you can clone the
repository from the host system.
Tom Clegg [Sun, 7 Dec 2014 23:09:00 +0000 (18:09 -0500)]
Reset listener=nil before running main() from test cases, so
waitForListener() does not get confused by listener!=nil left over
from previous tests. Fixes intermittent test failures.
Tim Pierce [Thu, 4 Dec 2014 16:26:58 +0000 (11:26 -0500)]
4465: test for regex link targets
The goal of this story is that the "report issue" dialog includes links
to a Github or Redmine page corresponding to the software versions for
Workbench and the API server, so the test should ensure not just that
there's a link with a given text, but that its target is a Github page
corresponding to a hexadecimal commit hash.
Tim Pierce [Thu, 4 Dec 2014 15:50:44 +0000 (10:50 -0500)]
4465: added api_version_text helper.
Per code review: the source_version returned in the discovery document
may include the string "-modified" if the API server is running from a
locally modified repository. The api_version_link that we generate for
this version must take that into account.
Peter Amstutz [Thu, 4 Dec 2014 14:55:06 +0000 (09:55 -0500)]
Touch the "crunch_refresh_trigger" file when the state changes. This notifies
all crunch-job instances to check the cancelled and state flags, so if a
running job changes state unexpectedly, it will be treated as a cancellation. refs #4314
Refactored some code into a VersionHelper to simplify testing.
Also updated the Rails.configuration.source_version settings for both
API server and Workbench to strip trailing newlines (which were
screwing up the URLs).
Brett Smith [Thu, 27 Nov 2014 02:35:07 +0000 (21:35 -0500)]
4291: Workbench Collection sharing buttons are actual buttons.
This prevents users from trying to open them in new windows/tabs and
getting a 404 response.
I had to rework the pipeline instance comparison JavaScript because it
was disabling the collection share button on page load. All that was
really necessary was making sure the event only fires when there
actually is a form#compare, but I did some other cleanup in the
process of learning that.
Brett Smith [Mon, 1 Dec 2014 16:07:07 +0000 (11:07 -0500)]
4676: Collection sharing popup is always JavaScript.
This fixes an issue where the response would sometimes be sent with
Content-Type: text/html. We thought it might be a race condition with
AJAX, but the browser was sending a correct Accept: header.
Brett Smith [Tue, 25 Nov 2014 22:57:47 +0000 (17:57 -0500)]
4291: Clean up HTTP methods in Workbench URL generators.
According to the docs at
<http://api.rubyonrails.org/files/actionview/lib/action_view/helpers/url_helper_rb.html>:
* `button_to` and `form_for` take :method as a symbol.
* `link_to` takes :method as a symbol, and only supports :delete,
:post, :patch, and :put. Any link that should be done with GET
should not have a method specified.
* Note that `form_tag` *does* take a string, so not every method
should be symbolized.
Brett Smith [Wed, 3 Dec 2014 15:12:40 +0000 (10:12 -0500)]
4705: Fix FUSE exception logging.
logger.exception() doesn't take the exception as an argument, it takes
a message like all the other logger methods. It gets the exception
information from sys.exc_info().
Brett Smith [Tue, 2 Dec 2014 15:23:21 +0000 (10:23 -0500)]
4591: Websockets server fetches fewer logs at a time.
Most of the out of memory errors we're seeing happen in the PostgreSQL
driver, which runs out of space to store results. Because Log records
are relatively large (holding two other records as JSON text),
fetching fewer in a batch should noticeably improve memory use. I
don't expect this to end the crashing, though—it seems like the
Websockets server grows large for a variety of reasons. Hopefully
this change will help make some of the others clearer.
Brett Smith [Tue, 2 Dec 2014 14:59:55 +0000 (09:59 -0500)]
4591: Avoid capturing critical exceptions in Websockets server.
Based on the current logs, the troubles we're currently hitting in
Websockets happen in push_events, where all the database work
happens. These exceptions wrap PostgreSQL driver errors; they inherit
from StandardError, so they're being caught by the rescue block.
This commit re-raises those exceptions, which will cause the server to
crash (and presumably be restarted by a supervisor like runit).
We do sometimes see NoMemoryError, but the block to catch is in
ineffective because it usually manifests earlier in on_connect, when
the connection is first made. In this case, Ruby's default exception
handling provides the behavior we want, so just remove the block.
In keeping with the theme of improved exception handling, I tightened
up the bad request detection.
Tim Pierce [Tue, 2 Dec 2014 20:32:52 +0000 (15:32 -0500)]
4621: collate_output pipes to python
Rewrote collate_output as create_output_collection, writing its output
data to a Python subprocess that invokes
arvados.api().collections().create(). Writing very large collection
manifests in-process makes Arvados.pm consume inordinate amounts of
memory.