Jagga Soorma
2014-03-29 19:01:26 UTC
Hi Everyone,
I am switching over from torque to slurm on a new cluster with gpu
resources. I have installed the latest stable release 14.03.0-1. I
have 2 nvidia gpu's on each node:
--
amber203:/etc/slurm # ls -l /dev/nvidia*
crw-rw-rw- 1 root video 195, 0 Mar 29 11:46 /dev/nvidia0
crw-rw-rw- 1 root video 195, 1 Mar 29 11:46 /dev/nvidia1
crw-rw-rw- 1 root video 195, 255 Mar 29 11:46 /dev/nvidiactl
amber203:/etc/slurm # nvidia-smi | grep Tesla
| 0 Tesla K20Xm Off | 0000:08:00.0 Off | 0 |
| 1 Tesla K20Xm Off | 0000:27:00.0 Off | 0 |
--
I have also updated the slurm.conf and gres.conf files across the
cluster with the following:
--
amber203:/etc/slurm # grep -i gpu /etc/slurm/slurm.conf
GresTypes=gpu
NodeName=amber[201-240] CPUs=20 RealMemory=32074 Sockets=2
CoresPerSocket=10 Gres=gpu:2 State=UNKNOWN
PartitionName=ambergpuprod Nodes=amber[201-240] Default=YES
MaxTime=INFINITE State=UP
amber203:/etc/slurm # cat /etc/slurm/gres.conf
NodeName=amber[201-240] Name=gpu File=/dev/nvidia[0-1]
--
However, after restarting all slurm services I am still getting the
following "grew/gpu count to low" message when running sinfo:
--
amber203:/etc/slurm # sinfo -lNe
Sat Mar 29 11:57:40 2014
NODELIST NODES PARTITION STATE
CPUS S:C:T MEMORY TMP_DISK WEIGHT FEATURES REASON
amber201 1 ambergpuprod* idle
20 2:10:1 32074 0 1 (null) none
amber[202,210,222,224-226,228-240] 19 ambergpuprod* down*
20 2:10:1 32074 0 1 (null) Not responding
amber203 1 ambergpuprod* drained*
20 2:10:1 32074 0 1 (null) gres/gpu count too l
amber[204-209,211-221,223,227] 19 ambergpuprod* drained
20 2:10:1 32074 0 1 (null) gres/gpu count too l
--
What am I missing here or how can I get more information about why
sinfo is reporting gpu count is too low? I am also tried the
following format in the gres.conf file without any luck:
--
Name=gpu File=/dev/nvidia0
Name=gpu File=/dev/nvidia1
--
Any help would be greatly appreciated!
Thanks,
-J
I am switching over from torque to slurm on a new cluster with gpu
resources. I have installed the latest stable release 14.03.0-1. I
have 2 nvidia gpu's on each node:
--
amber203:/etc/slurm # ls -l /dev/nvidia*
crw-rw-rw- 1 root video 195, 0 Mar 29 11:46 /dev/nvidia0
crw-rw-rw- 1 root video 195, 1 Mar 29 11:46 /dev/nvidia1
crw-rw-rw- 1 root video 195, 255 Mar 29 11:46 /dev/nvidiactl
amber203:/etc/slurm # nvidia-smi | grep Tesla
| 0 Tesla K20Xm Off | 0000:08:00.0 Off | 0 |
| 1 Tesla K20Xm Off | 0000:27:00.0 Off | 0 |
--
I have also updated the slurm.conf and gres.conf files across the
cluster with the following:
--
amber203:/etc/slurm # grep -i gpu /etc/slurm/slurm.conf
GresTypes=gpu
NodeName=amber[201-240] CPUs=20 RealMemory=32074 Sockets=2
CoresPerSocket=10 Gres=gpu:2 State=UNKNOWN
PartitionName=ambergpuprod Nodes=amber[201-240] Default=YES
MaxTime=INFINITE State=UP
amber203:/etc/slurm # cat /etc/slurm/gres.conf
NodeName=amber[201-240] Name=gpu File=/dev/nvidia[0-1]
--
However, after restarting all slurm services I am still getting the
following "grew/gpu count to low" message when running sinfo:
--
amber203:/etc/slurm # sinfo -lNe
Sat Mar 29 11:57:40 2014
NODELIST NODES PARTITION STATE
CPUS S:C:T MEMORY TMP_DISK WEIGHT FEATURES REASON
amber201 1 ambergpuprod* idle
20 2:10:1 32074 0 1 (null) none
amber[202,210,222,224-226,228-240] 19 ambergpuprod* down*
20 2:10:1 32074 0 1 (null) Not responding
amber203 1 ambergpuprod* drained*
20 2:10:1 32074 0 1 (null) gres/gpu count too l
amber[204-209,211-221,223,227] 19 ambergpuprod* drained
20 2:10:1 32074 0 1 (null) gres/gpu count too l
--
What am I missing here or how can I get more information about why
sinfo is reporting gpu count is too low? I am also tried the
following format in the gres.conf file without any luck:
--
Name=gpu File=/dev/nvidia0
Name=gpu File=/dev/nvidia1
--
Any help would be greatly appreciated!
Thanks,
-J