Satrajit Ghosh
2014-10-24 02:03:34 UTC
hi all,
is there a way to suspend gpu jobs with gang scheduling? or if not, is
there a way to ensure that gpu jobs don't enter suspend state?
here is a simplified description of the problem with a hypothetical cluster
of one node with 2 gpus and gang scheduling enabled.
timeline
- 4 gpu jobs submitted with --gres=gpu:1
- 2 gpu jobs start running
- 30s later gang scheduling kicks in
- suspends the two running jobs
- the next two jobs are then started
- these two new jobs are terminated because CUDA_DEVICES are not available.
cheers,
satra
is there a way to suspend gpu jobs with gang scheduling? or if not, is
there a way to ensure that gpu jobs don't enter suspend state?
here is a simplified description of the problem with a hypothetical cluster
of one node with 2 gpus and gang scheduling enabled.
timeline
- 4 gpu jobs submitted with --gres=gpu:1
- 2 gpu jobs start running
- 30s later gang scheduling kicks in
- suspends the two running jobs
- the next two jobs are then started
- these two new jobs are terminated because CUDA_DEVICES are not available.
cheers,
satra