8284: Fix confusion between %proc and %jobstep.
authorTom Clegg <tom@curoverse.com>
Mon, 25 Jan 2016 00:48:06 +0000 (19:48 -0500)
committerTom Clegg <tom@curoverse.com>
Mon, 25 Jan 2016 00:48:06 +0000 (19:48 -0500)
commit3a8714e6fcf41c46d1fde0a6a3e4beb1367d181d
treee7cd3b2c5fba8eecfa5f410866ad4486805210f8
parent13ca6c961ce700e84bfa4ace9ea715ce9610b7e5
8284: Fix confusion between %proc and %jobstep.

$proc{$pid}->{jobstep} is an index into @jobstep
$proc{$pid}->{jobstepname} is the name we told srun to use
$proc{$pid}->{killtime} is a deadline when we should kill the process
$jobstep[$jobstepid]->{stderr_at} is the time of last stderr received

We were mistakenly using $proc->{$pid}->{stderr_at}, which was always
undef and therefore always less than $last_squeue_check. This resulted
in jobs being killed as "slurm orphans" when the real reason they
hadn't been returned by waitpid() was that we hadn't finished
consuming their stderr yet.
sdk/cli/bin/crunch-job