gpu resource sharing (or not sharing) with gang scheduling

Satrajit Ghosh

2014-10-24 02:03:34 UTC

hi all,

is there a way to suspend gpu jobs with gang scheduling? or if not, is
there a way to ensure that gpu jobs don't enter suspend state?

here is a simplified description of the problem with a hypothetical cluster
of one node with 2 gpus and gang scheduling enabled.

timeline
- 4 gpu jobs submitted with --gres=gpu:1
- 2 gpu jobs start running
- 30s later gang scheduling kicks in
- suspends the two running jobs
- the next two jobs are then started
- these two new jobs are terminated because CUDA_DEVICES are not available.

cheers,

satra