J***@public.gmane.org
2014-07-07 16:59:34 UTC
Dell - Internal Use - Confidential
We are using slurm 2.6.4 as mechanism to load balance software builds from login/access servers to a pool of build servers provisioned for this task. We've been using it this way for several months for our automated builds with no problems, and started using it for user builds last week. Unfortunately, this has not gone quite as smoothly.
Our "make" wrapper script simply does a "srun -p build make ...". Most of the time this works, but in some cases even though the slurmstepd on the build node that spawned the make has exited, slurm still thinks the job is running. Eventually this consumes all the available nodes in the build partition and jobs start queuing until the completed jobs are manually canceled.
Has anyone encountered this before? Any suggestions of where I should investigate further would be greatly appreciated.
--jtc
We are using slurm 2.6.4 as mechanism to load balance software builds from login/access servers to a pool of build servers provisioned for this task. We've been using it this way for several months for our automated builds with no problems, and started using it for user builds last week. Unfortunately, this has not gone quite as smoothly.
Our "make" wrapper script simply does a "srun -p build make ...". Most of the time this works, but in some cases even though the slurmstepd on the build node that spawned the make has exited, slurm still thinks the job is running. Eventually this consumes all the available nodes in the build partition and jobs start queuing until the completed jobs are manually canceled.
Has anyone encountered this before? Any suggestions of where I should investigate further would be greatly appreciated.
--jtc