Configure SLURM to use GPU's

Discussion:

Krishna Teja

2014-08-08 14:28:34 UTC

I have been trying to configure SLURM so as to be able to use GPU's
available in some of the nodes in our cluster (compute-0-4,compute-0-5 and
compute-0-6 to be precise). I have followed the instructions given in the
SLURM website.

http://slurm.schedmd.com/gres.html

But that doesn't seem to work. I still get the same error as if the GPU's
weren't configured.

srun: error: Unable to allocate resources: Requested node configuration is
not available

Furthermore, i run a simple command to test if everything is fine with
SLURM, to print the hostnames of all the nodes using

srun -N7 -l /bin/hostname

and i get the following output.

srun: error: Duplicated NodeHostName compute-0-4 in the config file
srun: error: Duplicated NodeHostName compute-0-5 in the config file
srun: error: Duplicated NodeHostName compute-0-6 in the config file
4: compute-0-4.local
5: compute-0-5.local
6: compute-0-6.local
3: compute-0-3.local
1: compute-0-1.local
0: compute-0-0.local
2: compute-0-2.local

I have attached the slurm.conf file and gres.conf file. Can someone please
point to me what i am doing wrong. Any help appreciated!!

Seren Soner

2014-08-08 14:36:33 UTC

Permalink

You probably have redefined compute-0-[4-6] in /etc/slurm/nodenames.conf.

Post by Krishna Teja
I have been trying to configure SLURM so as to be able to use GPU's
available in some of the nodes in our cluster (compute-0-4,compute-0-5 and
compute-0-6 to be precise). I have followed the instructions given in the
SLURM website.
http://slurm.schedmd.com/gres.html
But that doesn't seem to work. I still get the same error as if the GPU's
weren't configured.
srun: error: Unable to allocate resources: Requested node configuration is
not available
Furthermore, i run a simple command to test if everything is fine with
SLURM, to print the hostnames of all the nodes using
srun -N7 -l /bin/hostname
and i get the following output.
srun: error: Duplicated NodeHostName compute-0-4 in the config file
srun: error: Duplicated NodeHostName compute-0-5 in the config file
srun: error: Duplicated NodeHostName compute-0-6 in the config file
4: compute-0-4.local
5: compute-0-5.local
6: compute-0-6.local
3: compute-0-3.local
1: compute-0-1.local
0: compute-0-0.local
2: compute-0-2.local
I have attached the slurm.conf file and gres.conf file. Can someone please
point to me what i am doing wrong. Any help appreciated!!

--
Seren Soner

Krishna Teja

2014-08-08 15:08:34 UTC

Permalink

Post by Seren Soner
You probably have redefined compute-0-[4-6] in /etc/slurm/nodenames.conf.

Post by Krishna Teja
I have been trying to configure SLURM so as to be able to use GPU's
available in some of the nodes in our cluster (compute-0-4,compute-0-5 and
compute-0-6 to be precise). I have followed the instructions given in the
SLURM website.
http://slurm.schedmd.com/gres.html
But that doesn't seem to work. I still get the same error as if the GPU's
weren't configured.
srun: error: Unable to allocate resources: Requested node configuration
is not available
Furthermore, i run a simple command to test if everything is fine with
SLURM, to print the hostnames of all the nodes using
srun -N7 -l /bin/hostname
and i get the following output.
srun: error: Duplicated NodeHostName compute-0-4 in the config file
srun: error: Duplicated NodeHostName compute-0-5 in the config file
srun: error: Duplicated NodeHostName compute-0-6 in the config file
4: compute-0-4.local
5: compute-0-5.local
6: compute-0-6.local
3: compute-0-3.local
1: compute-0-1.local
0: compute-0-0.local
2: compute-0-2.local
I have attached the slurm.conf file and gres.conf file. Can someone
please point to me what i am doing wrong. Any help appreciated!!

--
Seren Soner

Michael Robbert

2014-08-08 18:22:52 UTC

Permalink

Have you restarted slurmd on the nodes? I'm not sure if it is needed, but also restarting slurmctld on the master would be a good idea as well. A good check is to look at the output of "scontrol show node compute-0-4". It should have a Gres= entry. If all else fails look at the slurmd.log on the compute nodes and maybe try turning up the debugging if that doesn't show enough info.

Mike Robbert

Ok i've taken care of that part, no more Duplicated NodeHostName...error but the nodes still aren't configured for GPU's. Am i missing some step to be done after editing slurm.conf and creating gres.conf files?
You probably have redefined compute-0-[4-6] in /etc/slurm/nodenames.conf.
I have been trying to configure SLURM so as to be able to use GPU's available in some of the nodes in our cluster (compute-0-4,compute-0-5 and compute-0-6 to be precise). I have followed the instructions given in the SLURM website.
http://slurm.schedmd.com/gres.html
But that doesn't seem to work. I still get the same error as if the GPU's weren't configured.
srun: error: Unable to allocate resources: Requested node configuration is not available
Furthermore, i run a simple command to test if everything is fine with SLURM, to print the hostnames of all the nodes using
srun -N7 -l /bin/hostname
and i get the following output.
srun: error: Duplicated NodeHostName compute-0-4 in the config file
srun: error: Duplicated NodeHostName compute-0-5 in the config file
srun: error: Duplicated NodeHostName compute-0-6 in the config file
4: compute-0-4.local
5: compute-0-5.local
6: compute-0-6.local
3: compute-0-3.local
1: compute-0-1.local
0: compute-0-0.local
2: compute-0-2.local
I have attached the slurm.conf file and gres.conf file. Can someone please point to me what i am doing wrong. Any help appreciated!!
--
Seren Soner

Krishna Teja

2014-08-08 18:34:39 UTC

Permalink

Post by Michael Robbert
Have you restarted slurmd on the nodes? I'm not sure if it is needed, but
also restarting slurmctld on the master would be a good idea as well. A
good check is to look at the output of "scontrol show node compute-0-4". It
should have a Gres= entry. If all else fails look at the slurmd.log on the
compute nodes and maybe try turning up the debugging if that doesn't show
enough info.
Mike Robbert
Ok i've taken care of that part, no more Duplicated NodeHostName...error
but the nodes still aren't configured for GPU's. Am i missing some step to
be done after editing slurm.conf and creating gres.conf files?

Post by Seren Soner
You probably have redefined compute-0-[4-6] in
/etc/slurm/nodenames.conf.

Post by Krishna Teja
I have been trying to configure SLURM so as to be able to use GPU's
available in some of the nodes in our cluster (compute-0-4,compute-0-5 and
compute-0-6 to be precise). I have followed the instructions given in the
SLURM website.
http://slurm.schedmd.com/gres.html
But that doesn't seem to work. I still get the same error as if the
GPU's weren't configured.
srun: error: Unable to allocate resources: Requested node configuration
is not available
Furthermore, i run a simple command to test if everything is fine with
SLURM, to print the hostnames of all the nodes using
srun -N7 -l /bin/hostname
and i get the following output.
srun: error: Duplicated NodeHostName compute-0-4 in the config file
srun: error: Duplicated NodeHostName compute-0-5 in the config file
srun: error: Duplicated NodeHostName compute-0-6 in the config file
4: compute-0-4.local
5: compute-0-5.local
6: compute-0-6.local
3: compute-0-3.local
1: compute-0-1.local
0: compute-0-0.local
2: compute-0-2.local
I have attached the slurm.conf file and gres.conf file. Can someone
please point to me what i am doing wrong. Any help appreciated!!

--
Seren Soner

Michael Robbert

2014-08-08 23:27:50 UTC

Permalink

Does your slurmd detect the devices? I see these messages in the slurmd.log on my GPU nodes:

[2014-08-08T17:21:37.273] gpu 0 is device number 0
[2014-08-08T17:21:37.273] gpu 1 is device number 1

I am also assuming that the nodes are fully configured to use the GPUs with /dev/nvidia* device files.

[***@gpu001 slurm]# ls -l /dev/nvidia*
crw-rw-rw- 1 root root 195, 0 Aug 4 17:33 /dev/nvidia0
crw-rw-rw- 1 root root 195, 1 Aug 4 17:33 /dev/nvidia1
crw-rw-rw- 1 root root 195, 255 Aug 4 17:33 /dev/nvidiactl

Mike

I did restart slurmctld and slurmd on master node and slurmd on compute nodes. When i do scontrol show nodes, the nodes do have a Gres entry but at the end i get "Reason=gres/gpu count too low". slurmd.log hasn't got any errors (without turning on debugging)
Any ideas on how to fix that?
Krishna
Have you restarted slurmd on the nodes? I'm not sure if it is needed, but also restarting slurmctld on the master would be a good idea as well. A good check is to look at the output of "scontrol show node compute-0-4". It should have a Gres= entry. If all else fails look at the slurmd.log on the compute nodes and maybe try turning up the debugging if that doesn't show enough info.
Mike Robbert