Gres GPU Problem with new slurm cluster

Discussion:

Jagga Soorma

2014-03-29 19:01:26 UTC

Hi Everyone,

I am switching over from torque to slurm on a new cluster with gpu
resources. I have installed the latest stable release 14.03.0-1. I
have 2 nvidia gpu's on each node:

--
amber203:/etc/slurm # ls -l /dev/nvidia*
crw-rw-rw- 1 root video 195, 0 Mar 29 11:46 /dev/nvidia0
crw-rw-rw- 1 root video 195, 1 Mar 29 11:46 /dev/nvidia1
crw-rw-rw- 1 root video 195, 255 Mar 29 11:46 /dev/nvidiactl

amber203:/etc/slurm # nvidia-smi | grep Tesla
| 0 Tesla K20Xm Off | 0000:08:00.0 Off | 0 |
| 1 Tesla K20Xm Off | 0000:27:00.0 Off | 0 |
--

I have also updated the slurm.conf and gres.conf files across the
cluster with the following:

--
amber203:/etc/slurm # grep -i gpu /etc/slurm/slurm.conf
GresTypes=gpu
NodeName=amber[201-240] CPUs=20 RealMemory=32074 Sockets=2
CoresPerSocket=10 Gres=gpu:2 State=UNKNOWN
PartitionName=ambergpuprod Nodes=amber[201-240] Default=YES
MaxTime=INFINITE State=UP

amber203:/etc/slurm # cat /etc/slurm/gres.conf
NodeName=amber[201-240] Name=gpu File=/dev/nvidia[0-1]
--

However, after restarting all slurm services I am still getting the
following "grew/gpu count to low" message when running sinfo:

--

amber203:/etc/slurm # sinfo -lNe
Sat Mar 29 11:57:40 2014
NODELIST NODES PARTITION STATE
CPUS S:C:T MEMORY TMP_DISK WEIGHT FEATURES REASON
amber201 1 ambergpuprod* idle
20 2:10:1 32074 0 1 (null) none
amber[202,210,222,224-226,228-240] 19 ambergpuprod* down*
20 2:10:1 32074 0 1 (null) Not responding
amber203 1 ambergpuprod* drained*
20 2:10:1 32074 0 1 (null) gres/gpu count too l
amber[204-209,211-221,223,227] 19 ambergpuprod* drained
20 2:10:1 32074 0 1 (null) gres/gpu count too l
--

What am I missing here or how can I get more information about why
sinfo is reporting gpu count is too low? I am also tried the
following format in the gres.conf file without any luck:

--
Name=gpu File=/dev/nvidia0
Name=gpu File=/dev/nvidia1
--

Any help would be greatly appreciated!

Thanks,
-J

Jagga Soorma

2014-03-29 20:11:25 UTC

Permalink

Okay, so looks like I just had to clear the node manually using
scontrol after updating the gres.conf on each node. Isn't there a way
to have slurm automatically do this without having to do manual
intervention using the scontrol command or sview?

Thanks,
-J

Post by Jagga Soorma
Hi Everyone,
I am switching over from torque to slurm on a new cluster with gpu
resources. I have installed the latest stable release 14.03.0-1. I
--
amber203:/etc/slurm # ls -l /dev/nvidia*
crw-rw-rw- 1 root video 195, 0 Mar 29 11:46 /dev/nvidia0
crw-rw-rw- 1 root video 195, 1 Mar 29 11:46 /dev/nvidia1
crw-rw-rw- 1 root video 195, 255 Mar 29 11:46 /dev/nvidiactl
amber203:/etc/slurm # nvidia-smi | grep Tesla
| 0 Tesla K20Xm Off | 0000:08:00.0 Off | 0 |
| 1 Tesla K20Xm Off | 0000:27:00.0 Off | 0 |
--
I have also updated the slurm.conf and gres.conf files across the
--
amber203:/etc/slurm # grep -i gpu /etc/slurm/slurm.conf
GresTypes=gpu
NodeName=amber[201-240] CPUs=20 RealMemory=32074 Sockets=2
CoresPerSocket=10 Gres=gpu:2 State=UNKNOWN
PartitionName=ambergpuprod Nodes=amber[201-240] Default=YES
MaxTime=INFINITE State=UP
amber203:/etc/slurm # cat /etc/slurm/gres.conf
NodeName=amber[201-240] Name=gpu File=/dev/nvidia[0-1]
--
However, after restarting all slurm services I am still getting the
--
amber203:/etc/slurm # sinfo -lNe
Sat Mar 29 11:57:40 2014
NODELIST NODES PARTITION STATE
CPUS S:C:T MEMORY TMP_DISK WEIGHT FEATURES REASON
amber201 1 ambergpuprod* idle
20 2:10:1 32074 0 1 (null) none
amber[202,210,222,224-226,228-240] 19 ambergpuprod* down*
20 2:10:1 32074 0 1 (null) Not responding
amber203 1 ambergpuprod* drained*
20 2:10:1 32074 0 1 (null) gres/gpu count too l
amber[204-209,211-221,223,227] 19 ambergpuprod* drained
20 2:10:1 32074 0 1 (null) gres/gpu count too l
--
What am I missing here or how can I get more information about why
sinfo is reporting gpu count is too low? I am also tried the
--
Name=gpu File=/dev/nvidia0
Name=gpu File=/dev/nvidia1
--
Any help would be greatly appreciated!
Thanks,
-J

David Bigagli

2014-03-31 16:46:28 UTC

Permalink

You may want to look at the variable ReturnToService in slurm.conf

http://slurm.schedmd.com/slurm.conf.html

Post by Jagga Soorma
Okay, so looks like I just had to clear the node manually using
scontrol after updating the gres.conf on each node. Isn't there a way
to have slurm automatically do this without having to do manual
intervention using the scontrol command or sview?
Thanks,
-J

--
Thanks,
/David/Bigagli

www.schedmd.com