Discussion:
Job arrays and exit code status
Yuri D'Elia
2014-07-21 09:59:33 UTC
Permalink
Hi everyone.

I don't see much about how individual jobs in an array are tracked after
completion. It also looks like individual job indexes are not stored in
'sacct'/accounting.

I kind of expected that the exit code of the main job in an array could
be a logical operation of all the individual indexes in the array. That
is, if any of the job returned non-zero, the main exit code would also
be non-zero.

But after some random testing (slurm 14.03.4), it looks like the main
exit status is just the exit status of the first index.

As such, using any dependency based on afterok/afternotok is kind of
pointless. And since there's no accounting for each index, I'm at loss here.

Any comment?

And since we're discussing this, it would also make sense to have a
policy for job array failures. A failure for a single index could:

- flag job as FAILED, but still continue executing the remaining indexes
- flag job as COMPLETED as long as at least one index was ok
- flag job as FAILED, but cancel the job as well

For an array, I would guess the last mode makes more sense for a default.
Loading...