Joshua C. Randall [Thu, 18 Feb 2016 00:02:06 +0000 (00:02 +0000)]
Adds sanity check on number of collections retrieved
Changes GetCollections() to return an error if the number
of collections retrieved is less than the initial number
of collections.
radhika [Wed, 17 Feb 2016 14:32:09 +0000 (09:32 -0500)]
closes #8286
Merge branch '8286-fav-projects'
radhika [Wed, 17 Feb 2016 14:30:58 +0000 (09:30 -0500)]
8286: bigger star icon with stronger color contrast.
radhika [Tue, 16 Feb 2016 17:00:09 +0000 (12:00 -0500)]
8286: also include link to Home in favorites section of the Projects dropdown, so that the user does not have to scroll through all of his favorites to seek in MyProjects section.
radhika [Tue, 16 Feb 2016 02:33:01 +0000 (21:33 -0500)]
Merge branch 'master' into 8286-fav-projects
radhika [Tue, 16 Feb 2016 02:32:00 +0000 (21:32 -0500)]
closes #8079
Merge branch '8079-api-client-auth-uuid'
radhika [Tue, 16 Feb 2016 01:46:05 +0000 (20:46 -0500)]
Merge branch 'master' into 8079-api-client-auth-uuid
radhika [Tue, 16 Feb 2016 01:44:59 +0000 (20:44 -0500)]
8286: check if project is starred only when current_user is not null (anonymous user case).
radhika [Mon, 15 Feb 2016 22:28:55 +0000 (17:28 -0500)]
8286: to facilitate in-place star icon refresh without the whole page refresh, it became necessary
to refresh the Projects menu after star/unstar action. Hence, moved the breadcrumbs code from
body.html into a partial.
radhika [Mon, 15 Feb 2016 20:26:05 +0000 (15:26 -0500)]
8286: convert star method into action controller action and refresh the star icon in place rather than a full page refresh.
radhika [Mon, 15 Feb 2016 18:50:48 +0000 (13:50 -0500)]
Merge branch 'master' into 8286-fav-projects
radhika [Mon, 15 Feb 2016 18:43:08 +0000 (13:43 -0500)]
closes #8183
Merge branch '8183-projects-dropdown'
radhika [Mon, 15 Feb 2016 18:42:44 +0000 (13:42 -0500)]
Merge branch 'master' into 8183-projects-dropdown
radhika [Mon, 15 Feb 2016 18:33:43 +0000 (13:33 -0500)]
8286: include favorites and top-level my projects in projects drowdown.
Tom Clegg [Mon, 15 Feb 2016 18:22:03 +0000 (13:22 -0500)]
8409: Use 80% utilization as keep_cache_mb_per_task reporting threshold. refs #8409
radhika [Mon, 15 Feb 2016 17:47:22 +0000 (12:47 -0500)]
8079: add rescue to drop index
radhika [Mon, 15 Feb 2016 17:35:02 +0000 (12:35 -0500)]
Merge branch '8183-projects-dropdown' into 8286-fav-projects
Conflicts:
apps/workbench/app/controllers/application_controller.rb
apps/workbench/app/views/application/_projects_tree_menu.html.erb
apps/workbench/test/controllers/projects_controller_test.rb
radhika [Mon, 15 Feb 2016 16:29:04 +0000 (11:29 -0500)]
Merge branch 'master' into 8183-projects-dropdown
Tom Clegg [Mon, 15 Feb 2016 15:49:39 +0000 (10:49 -0500)]
Merge branch '8178-trash-interface-generic-volume-test' closes #8178
Tom Clegg [Thu, 4 Feb 2016 23:28:43 +0000 (18:28 -0500)]
8178: Stop accepting zeroed data, now that the s3test bug is fixed.
radhika [Wed, 27 Jan 2016 04:31:27 +0000 (23:31 -0500)]
8178: add generic volume tests for trash / untrash interface.
radhika [Mon, 15 Feb 2016 13:40:54 +0000 (08:40 -0500)]
Merge branch 'master' into 8079-api-client-auth-uuid
radhika [Mon, 15 Feb 2016 13:36:47 +0000 (08:36 -0500)]
8183: display top three levels of projects in menu, improve message when there are too many projects, improve the test.
Tom Clegg [Thu, 11 Feb 2016 21:25:54 +0000 (16:25 -0500)]
Process live logs for unfinished jobs in pipeline mode, too.
No issue #
radhika [Sat, 13 Feb 2016 01:57:25 +0000 (20:57 -0500)]
8183: retrieve only 3 levels of projects while building projects dropdown.
Brett Smith [Fri, 12 Feb 2016 23:38:36 +0000 (18:38 -0500)]
8203: crunch-job tempfails after failing to install a Docker image.
* Use `bash -o pipefail` to run the image installer shell script, to
reliably detect more failures.
* Exit EX_RETRY_UNLOCKED after a failure to install the image.
Refs #8203, because I'm cautiously optimistic that this will reduce
the incidence of the "can't find UID" problem.
radhika [Fri, 12 Feb 2016 21:50:34 +0000 (16:50 -0500)]
8183: while displaying "my projects" tree, just show only the user's projects and omit shared projects
even when there are no more than 300 projects to avoid confusion.
radhika [Fri, 12 Feb 2016 21:24:41 +0000 (16:24 -0500)]
Merge branch 'master' into 8183-projects-dropdown
radhika [Fri, 12 Feb 2016 19:03:57 +0000 (14:03 -0500)]
8286: add an integration test to star / unstar project by clicking on the icon.
radhika [Thu, 11 Feb 2016 19:53:31 +0000 (14:53 -0500)]
8286: added test "unshare project and verify that it is no longer included in shared user's starred projects"
Peter Amstutz [Thu, 11 Feb 2016 14:41:17 +0000 (09:41 -0500)]
Merge branch '8409-report-keep-cache' closes #8409
Peter Amstutz [Thu, 11 Feb 2016 14:27:21 +0000 (09:27 -0500)]
8409: Adjust recommended miss rate to below 0.2%
Peter Amstutz [Wed, 10 Feb 2016 21:42:02 +0000 (16:42 -0500)]
8409: Add recommendation if cache miss rate is above 0.5%. Fix tests.
Peter Amstutz [Wed, 10 Feb 2016 20:33:41 +0000 (15:33 -0500)]
8409: Report Keep cache miss rate & Keep cache utilization
Brett Smith [Wed, 10 Feb 2016 20:02:48 +0000 (15:02 -0500)]
Build python-glags backports < version 3.0.
No issue #.
radhika [Wed, 10 Feb 2016 18:55:40 +0000 (13:55 -0500)]
Merge branch 'master' into 8286-fav-projects
radhika [Wed, 10 Feb 2016 18:51:20 +0000 (13:51 -0500)]
8079: add down migration to api_client_authorizations_search_index
Brett Smith [Wed, 10 Feb 2016 18:19:11 +0000 (13:19 -0500)]
Pin PySDK's gflags dependency to <3.0.
We've built and tested with 3.0 successfully, but its ChangeLog says:
* A lot of potentially backwards incompatible changes since 2.0.
* This version is NOT recommended to use in production. Some of the files and
documentation has been lost during export; this will be fixed in next
versions.
We found out about this after 3.0.2 broke our tests.
Take their advice for now. No issue #.
radhika [Wed, 10 Feb 2016 17:55:22 +0000 (12:55 -0500)]
Merge branch 'master' into 8286-fav-projects
Tom Clegg [Wed, 10 Feb 2016 17:52:08 +0000 (12:52 -0500)]
Merge branch '8341-crunchstat-job-time-axis'
radhika [Wed, 10 Feb 2016 17:23:50 +0000 (12:23 -0500)]
Merge branch 'master' into 8079-api-client-auth-uuid
radhika [Wed, 10 Feb 2016 17:19:23 +0000 (12:19 -0500)]
8079: Added support get using uuid and list using uuid or api_token and added tests.
Tom Clegg [Wed, 10 Feb 2016 17:10:53 +0000 (12:10 -0500)]
8341: Fall back to live logs if log collection is saved but missing.
Tom Clegg [Wed, 10 Feb 2016 16:14:54 +0000 (11:14 -0500)]
8341: Update test results.
Tom Clegg [Wed, 10 Feb 2016 03:44:02 +0000 (22:44 -0500)]
8341: Retrieve only the log attributes that actually get used.
Tom Clegg [Tue, 9 Feb 2016 18:53:38 +0000 (13:53 -0500)]
8341: In pipeline mode, process all jobs concurrently.
Tom Clegg [Tue, 9 Feb 2016 16:49:35 +0000 (11:49 -0500)]
8341: Include Keep network activity in net stats.
Tom Clegg [Tue, 9 Feb 2016 15:27:47 +0000 (10:27 -0500)]
8341: Fix up debug labels. Avoid deadlock after exceptions in thread.
Tom Clegg [Mon, 8 Feb 2016 20:47:02 +0000 (15:47 -0500)]
8341: Do not round up Y axis to even numbers, just use max series value.
Remove Y axis labels (so X axis matches other graphs from the same
job), add grid lines.
Tom Clegg [Mon, 8 Feb 2016 14:56:10 +0000 (09:56 -0500)]
8341: Use "time since job start", not "time since task start", as X axis.
Tom Clegg [Wed, 10 Feb 2016 15:51:33 +0000 (10:51 -0500)]
Merge branch '8284-fix-slurm-queue-timestamp-check' closes #8284
Tom Clegg [Wed, 10 Feb 2016 15:22:07 +0000 (10:22 -0500)]
Emit log when installing docker image.
Avoids creating the illusion that "clean work dirs" is taking forever.
No issue #
Tom Clegg [Wed, 10 Feb 2016 15:08:57 +0000 (10:08 -0500)]
Merge branch '7263-better-busy-behavior' refs #7263
Tom Clegg [Fri, 22 Jan 2016 20:02:21 +0000 (15:02 -0500)]
7263: Avoid getting stuck processing stderr for one task for a long time.
Do not sleep(0.1) unless pipes are idle.
Brett Smith [Tue, 9 Feb 2016 22:10:08 +0000 (17:10 -0500)]
crunch-job detects more "io aborted" SLURM errors.
It's seemingly random whether SLURM reports "Aborting, io aborted and
missing step" or "Aborting, missing step and io aborted". Extend the
regexp to catch both. No issue #.
Brett Smith [Tue, 9 Feb 2016 21:42:59 +0000 (16:42 -0500)]
Merge branch '8406-tempfail-after-retry-unlocked'
Closes #8406, #8407.
Brett Smith [Tue, 9 Feb 2016 21:42:12 +0000 (16:42 -0500)]
8406: Update comment to match new code.
Peter Amstutz [Tue, 9 Feb 2016 21:25:45 +0000 (16:25 -0500)]
8406: @job_retry_counts.include? jobrecord.uuid because @job_retry_counts has a default value.
Peter Amstutz [Tue, 9 Feb 2016 20:53:13 +0000 (15:53 -0500)]
8406: Treat EXIT_TEMPFAIL as EXIT_RETRY_UNLOCKED if we have previously gotten
EXIT_RETRY_UNLOCKED (because the job is now in "Running" state.)
Nico Cesar [Tue, 9 Feb 2016 21:34:34 +0000 (16:34 -0500)]
I "fonud" a typo
no issue #
Peter Amstutz [Tue, 9 Feb 2016 17:32:42 +0000 (12:32 -0500)]
Merge branch '8404-catch-interrupted-syscall' closes #8404
Peter Amstutz [Tue, 9 Feb 2016 17:31:07 +0000 (12:31 -0500)]
8404: Adjust try block to just surround os.wait().
Peter Amstutz [Tue, 9 Feb 2016 16:41:13 +0000 (11:41 -0500)]
8404: catch and continue from interrupted system call from os.wait()
radhika [Tue, 9 Feb 2016 16:11:59 +0000 (11:11 -0500)]
8079: add uuid to api_client_authorizations_search_index and add uuid to all api_client_authorizations test fixtures.
radhika [Tue, 9 Feb 2016 14:57:15 +0000 (09:57 -0500)]
8079: update the migration script to use the api_token as the seed
radhika [Tue, 9 Feb 2016 13:57:50 +0000 (08:57 -0500)]
8079: add uuid to api_client_authorizations
Tom Clegg [Mon, 8 Feb 2016 21:09:28 +0000 (16:09 -0500)]
Fix nodemanager test race. No issue #
Tom Clegg [Mon, 8 Feb 2016 19:32:45 +0000 (14:32 -0500)]
Merge branch '8341-live-crunchstat-summary' refs #8341
Tom Clegg [Mon, 8 Feb 2016 19:18:42 +0000 (14:18 -0500)]
8341: Use a Queue of lines and one thread, instead of a succession of threads and a deque of buffers.
Tom Clegg [Mon, 8 Feb 2016 01:19:45 +0000 (20:19 -0500)]
8341: Move reader classes to reader.py.
Tom Clegg [Mon, 8 Feb 2016 01:15:00 +0000 (20:15 -0500)]
8341: Use a worker thread to get page N+1 of logs while parsing page N.
Tom Clegg [Mon, 8 Feb 2016 00:43:02 +0000 (19:43 -0500)]
8341: Get job log from logs API if the log has not been written to Keep yet.
Tom Clegg [Mon, 8 Feb 2016 19:29:03 +0000 (14:29 -0500)]
Merge branch '8289-no-extra-orders' closes #8289
Tom Clegg [Mon, 8 Feb 2016 19:28:02 +0000 (14:28 -0500)]
8289: Strip redundant orders, even when provided explicitly by client.
Tom Clegg [Sat, 23 Jan 2016 05:23:49 +0000 (00:23 -0500)]
8289: Do not add fallback orders if client already specified an unambiguous order.
Peter Amstutz [Mon, 8 Feb 2016 16:28:53 +0000 (11:28 -0500)]
Merge branch '7667-node-manager-logging' refs #7667
Peter Amstutz [Mon, 8 Feb 2016 16:28:11 +0000 (11:28 -0500)]
7667: Store node size in a table so to avoid blocking on booting and shutdown
actors to ask node size.
Peter Amstutz [Mon, 8 Feb 2016 03:52:51 +0000 (22:52 -0500)]
7667: Fix log message
Tom Clegg [Sat, 6 Feb 2016 00:45:30 +0000 (19:45 -0500)]
Merge branch '8285-fuse-subscribe-websockets' closes #8285
Tom Clegg [Sat, 6 Feb 2016 00:39:42 +0000 (19:39 -0500)]
8285: Test that arvados.events.subscribe() is called only when needed.
Add missing TagsDirectory.want_event_subscribe().
Peter Amstutz [Sat, 6 Feb 2016 00:17:42 +0000 (19:17 -0500)]
8285: Add test for listen_for_events
radhika [Fri, 5 Feb 2016 23:25:11 +0000 (18:25 -0500)]
8183: add test to check build of my projects tree with the new method; update the method implementation
to accommodate testing by make the page_size an argument.
Peter Amstutz [Fri, 5 Feb 2016 21:39:25 +0000 (16:39 -0500)]
8285: Add want_event_subscribe flag to subclasses of fusedir.Directory,
determine whether to call listen_for_events based on it.
radhika [Fri, 5 Feb 2016 17:00:58 +0000 (12:00 -0500)]
8183: When there are more than 200 readable projects, build the tree in steps;
fetch projects under home, then subprojects under those projects and so on
until we exceed the 200 limit. In that case, also display a message in Projects
dropdown informing the user that all projects are not retrieved.
The chooser version of the project tree is untouched (and hence not performing well)
until we have the favorite projects implementation is in place. This is because
there is no search capability in the chooser dialog and hence no way to choose a
shared project when the limit is exceeded in this new top-down approach.
Peter Amstutz [Fri, 5 Feb 2016 16:10:43 +0000 (11:10 -0500)]
7667: Combine polling logs into fewer lines for less noise. Adjust message
when last_ping_at is unexpectedly none to be less severe (can happen in
innocent circumstances). Report nodes in "booted" list as "booting" since they
are unpaired. Fix tests.
Brett Smith [Fri, 5 Feb 2016 09:52:43 +0000 (04:52 -0500)]
7868: Update API server's arvados-cli version.
Curoverse clusters are deployed by setting CRUNCH_JOB_BIN,
effectively excluding it from the bundle, but this is not true for
clusters deployed following the install guide. Out of the box,
they'll use the version of crunch-job that's actually in the
arvados-cli gem in the bundle.
crunch-dispatch has functionality in it that requires a newer
arvados-cli, so update accordingly. This is not exactly the version
produced by #7868, but it's pretty close.
I think there's a strong case that we should update this version
whenever we make a substantial change to crunch-job. But since I'm
pushing this without discussion or review, I'm doing the smallest
thing possible.
Refs #7868.
Peter Amstutz [Thu, 4 Feb 2016 23:46:31 +0000 (18:46 -0500)]
7667: Node manager bug fixes and logging improvements.
* ComputeNodeSetupActor will now finish if there is an unhandled exception.
* ComputeNodeMonitorActor now explains why a node that is in the shutdown window
is not eligible for shutdown.
* Logging in nodes_wanted now distinguishes idle/busy/booting/shutting down.
* Logging by actors is now class name and a portion of the actor urn, so actions
of a specific actor can be consistently identified.
Tom Clegg [Thu, 4 Feb 2016 19:29:39 +0000 (14:29 -0500)]
Recognize another way slurm tells us about node failures.
Retry, instead of giving up, in situations like this:
2016-02-02_08:42:26 wx7k5-8i9sb-guk2lv53z3572dc 40682 3 stderr srun: error: Aborting, io error and missing step on node 0
2016-02-02_08:42:26 wx7k5-8i9sb-guk2lv53z3572dc 40682 3 stderr srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
2016-02-02_08:42:28 wx7k5-8i9sb-guk2lv53z3572dc 40682 3 stderr srun: error: Timed out waiting for job step to complete
2016-02-02_08:42:28 wx7k5-8i9sb-guk2lv53z3572dc 40682 3 child 42984 on compute26.1 exit 0 success=
2016-02-02_08:42:28 wx7k5-8i9sb-guk2lv53z3572dc 40682 3 ERROR: Task process exited 0, but never updated its task record to indicate success and record its output.
2016-02-02_08:42:28 wx7k5-8i9sb-guk2lv53z3572dc 40682 3 failure (#1, permanent) after 560 seconds
2016-02-02_08:42:28 wx7k5-8i9sb-guk2lv53z3572dc 40682 3 task output (0 bytes):
No issue #
Tom Clegg [Thu, 4 Feb 2016 18:17:31 +0000 (13:17 -0500)]
Merge branch '8288-poll-client-close-timeout' refs #8288
Tom Clegg [Mon, 1 Feb 2016 06:58:34 +0000 (01:58 -0500)]
8288: Add timeout option to close() method of event clients.
Previously in EventClient, close() didn't wait for anything. Now, if a
timeout is given, it waits for ws4py to call the closed() callback to
indicate the connection has closed.
Previously in PollClient, close() waited indefinitely for the polling
thread to terminate. This can take a very long time if, for example,
there are multiple subscriptions and the "get logs" API transaction is
slow.
The only apparent reason a caller would want to wait here at all is to
guarantee the simplifying assumption the on_event() callback is never
called after close(). Now, instead of letting the thread run until
all events are received and handled, PollClient achieves this the same
way EventClient does: ignore events that arrive after close().
radhika [Thu, 4 Feb 2016 15:45:26 +0000 (10:45 -0500)]
8183: set limit on my_toplevel_projects
radhika [Thu, 4 Feb 2016 15:21:00 +0000 (10:21 -0500)]
8183: show only toplevel projects in the Projects dropdown in breadcrumbs.
Brett Smith [Thu, 4 Feb 2016 10:33:24 +0000 (05:33 -0500)]
Make install guide slurm.conf more Arvados-compliant.
* SelectType=select/linear allocates entire nodes at a time. The
previous value scheduled individual cores.
* With that change, SelectTypeParameters=CR_CPU_Memory is not valid.
Remove it, as we do in production.
* The setting of FastSchedule seems less pressing, but 0 is what we
use in production, so share that here too.
No issue #.
Peter Amstutz [Wed, 3 Feb 2016 22:51:46 +0000 (17:51 -0500)]
Try to make logging identify the actor consistently
Peter Amstutz [Wed, 3 Feb 2016 20:54:18 +0000 (15:54 -0500)]
Merge branch '6702-gce-node-create-fix' closes #6702
Tom Clegg [Wed, 3 Feb 2016 17:51:54 +0000 (12:51 -0500)]
Merge branch '8288-arv-mount-deadlock' refs #8288
Tom Clegg [Tue, 2 Feb 2016 21:46:35 +0000 (16:46 -0500)]
8288: Do not call operations.destroy() as a last resort, just abandon the llfuse thread.
Tom Clegg [Mon, 1 Feb 2016 08:01:31 +0000 (03:01 -0500)]
8288: Add test case for --exec mode.
Tom Clegg [Mon, 1 Feb 2016 02:43:30 +0000 (21:43 -0500)]
8288: Give fusermount -u a chance to work before resorting to operations.destroy().
Log a warning when resorting to operations.destroy().
De-duplicate setup/teardown code so more of the --exec code path is exercised in tests.
Tom Clegg [Wed, 3 Feb 2016 17:50:31 +0000 (12:50 -0500)]
8123: Install chartjs.js asset file.
...during "setup.py install" too, not just when installing via
package.
refs #8123