git.arvados.org - arvados.git/commit

author	Brett Smith <brett@curoverse.com>
	Mon, 15 Jun 2015 17:54:36 +0000 (13:54 -0400)
committer	Brett Smith <brett@curoverse.com>
	Mon, 22 Jun 2015 20:49:52 +0000 (16:49 -0400)
commit	b269c28f1d54e8609f36c8aeb77a2b6025172066
tree	a33687bcec1a55add8be72c8b8c118e8664ccf04	tree \| snapshot
parent	24b4d1ad90558332cd5251b265a54c21ffdbfd36	commit \| diff

4410: Crunch retries jobs when all SLURM nodes fail.

See the ticket for detailed background discussion and implementation
rationale, especially notes 13 and 14.

This required a couple of ancillary changes:

* crunch-job now makes a distinction between "task failed because a
  node failed," and "task failed for other temporary reason."  It uses
  this additional information to decide when it should retry tasks
  itself, and when it needs to give up and kick the problem up to
  crunch-dispatch.

* crunch-job now handles creating log collections itself from
  manifests generated by arv-put.  This enables it to append to logs
  generated during previous attempts to run the job.

sdk/cli/bin/crunch-job		diff \| blob \| history
services/api/script/crunch-dispatch.rb		diff \| blob \| history