radhika [Sat, 4 Jun 2016 14:06:09 +0000 (10:06 -0400)]
8876: introduce view helper methods such as link_to_log and queuedtime etc so that the views do not have to do too many decisions based on the state of the work unit.
Tom Clegg [Fri, 27 May 2016 01:17:31 +0000 (21:17 -0400)]
9272: Fix up state transitions:
* Change state to Running only at the last possible moment before
starting the container.
* When erroring out before Running, change state back to Queued.
* Do not save log/output/exit code when changing state to Cancelled.
Incidental fixes:
* Clean up error handling in Run()
* Don't create a collection for (or try to attach to the container)
the second "cleanup activities" log that gets opened after closing
the real container log.
Peter Amstutz [Thu, 2 Jun 2016 21:59:20 +0000 (17:59 -0400)]
9187: Improve squeue synchronization
* Put squeue functions into separate file.
* CheckSqueue() now blocks on a condition variable until the next successful
update of squeue, which then wakes up all goroutines waiting on CheckSqueue().
* Never do anything when squeue returns an error.
* Merge submitting, monitoring, and cleanup behaviors into a single goroutine
which updates based on CheckSqueue() instead of a ticker.
* Introduce a lock on squeue, sbatch and scancel operations, so that on next
wakeup the queue is guaranteed to reflect most recent sbatch/scancel
operations.
Peter Amstutz [Wed, 1 Jun 2016 20:06:26 +0000 (16:06 -0400)]
9187: Slurm dispatcher improvements around squeue
* Clarify that status updates are not guaranteed to be delivered on a
heartbeat.
* Refactor slurm dispatcher to monitor the container in squeue in a separate
goroutine.
* Refactor polling squeue to a single goroutine and cache the results so that
monitoring 100 containers doesn't result in 100 calls to squeue.
* No longer set up strigger to cancel job on finish, instead cancel running
jobs not in squeue.
* Test both cases where a job is/is not in squeue.
Brett Smith [Tue, 31 May 2016 20:35:53 +0000 (16:35 -0400)]
9242: Update Python module paths for CentOS 6.
I am more sure that this is correct, based on multiple data points
from Python 2 and 3 packages across CentOS 6 and 7.
This might be a change that's fallout from 44ceaa474a330f12dd9e00115af107d7258044f2.
Refs #9242.
Brett Smith [Fri, 27 May 2016 22:30:57 +0000 (18:30 -0400)]
8959: Remove redundant python-gflags fpm-info.sh.
I added this file in 495a485ff. Later, Nico pinned the version in
run-build-packages, in a8bbf6ef, to try to fix #8959. However, odds
are that #8959 was an ops problem, and not a package building problem:
the gflags 3.0 packages were still published on our repository, and
needed to be removed there.
Having both files causes trouble when you're building backports from
scratch. We haven't noticed because Jenkins never does that. But
I'm working on new packages and getting:
Loading fpm overrides from /arvados/backports/python-gflags/fpm-info.sh
Peter Amstutz [Thu, 19 May 2016 18:12:42 +0000 (14:12 -0400)]
9187: Refactor dispatcher support into common library and update to use Locking API.
New dispatcher package in Go SDK provides framework for monitoring list of
queued/locked/running containers. Try to lock containers in the queue; locked
or running containers are passed to RunContainer goroutine supplied by the
specific dispatcher. Refactor existing dispatchers (-local and -slurm) to use
this framework. Dispatchers have crash recovery behavior, can put containers
which are unaccounted in cancelled state.
Peter Amstutz [Thu, 26 May 2016 13:51:24 +0000 (09:51 -0400)]
9303: Fetch arv_node before trying to shut down node, because monitor actor may
go away once the node has been successfully shut down. Also handle case of
node_finished_shutdown called after shutdown actor is stopped.
Peter Amstutz [Wed, 18 May 2016 20:33:57 +0000 (16:33 -0400)]
8653: cwl-runner crunch script rewrites keep file paths into CWL File objects.
Clean up argument handling in arvados-cwl-runner so that --create-template
doesn't require a job object, and that --help doesn't present options that are
irrelevant or don't work.
Peter Amstutz [Tue, 17 May 2016 20:59:20 +0000 (16:59 -0400)]
8236: Restore os.killpg(). Create a new process group so that it won't kill
the parent process by accident. Watchdog process now only monitors specific
actors.