Retry, instead of giving up, in situations like this:
2016-02-02_08:42:26 wx7k5-8i9sb-guk2lv53z3572dc 40682 3 stderr srun: error: Aborting, io error and missing step on node 0
2016-02-02_08:42:26 wx7k5-8i9sb-guk2lv53z3572dc 40682 3 stderr srun: Job step aborted: Waiting up to 2 seconds for job step to finish.
2016-02-02_08:42:28 wx7k5-8i9sb-guk2lv53z3572dc 40682 3 stderr srun: error: Timed out waiting for job step to complete
2016-02-02_08:42:28 wx7k5-8i9sb-guk2lv53z3572dc 40682 3 child 42984 on compute26.1 exit 0 success=
2016-02-02_08:42:28 wx7k5-8i9sb-guk2lv53z3572dc 40682 3 ERROR: Task process exited 0, but never updated its task record to indicate success and record its output.
2016-02-02_08:42:28 wx7k5-8i9sb-guk2lv53z3572dc 40682 3 failure (#1, permanent) after 560 seconds
2016-02-02_08:42:28 wx7k5-8i9sb-guk2lv53z3572dc 40682 3 task output (0 bytes):
No issue #
# whoa.
$main::please_freeze = 1;
}
- elsif ($line =~ /srun: error: Node failure on/) {
+ elsif ($line =~ /srun: error: (Node failure on|Aborting, io error)/) {
my $job_slot_index = $jobstep[$job]->{slotindex};
$slot[$job_slot_index]->{node}->{fail_count}++;
$jobstep[$job]->{tempfail} = 1;