4410: Crunch retries jobs when all SLURM nodes fail.
See the ticket for detailed background discussion and implementation
rationale, especially notes 13 and 14.
This required a couple of ancillary changes:
* crunch-job now makes a distinction between "task failed because a
node failed," and "task failed for other temporary reason." It uses
this additional information to decide when it should retry tasks
itself, and when it needs to give up and kick the problem up to
crunch-dispatch.
* crunch-job now handles creating log collections itself from
manifests generated by arv-put. This enables it to append to logs
generated during previous attempts to run the job.