Discussion:
slurm salloc
Eva Hocks
2014-09-11 22:44:30 UTC
Permalink
I am trying to configure the latest slurm 14.03 and am running into
problem to prevent slurm from running jobs on the control node.

sinfo shows 3 nodes configure in the slurm.conf:
active up 2:00:00 1 down* hpc-0-5
active up 2:00:00 1 mix hpc-0-4
active up 2:00:00 1 idle hpc-0-6


but when I use salloc I end up on the head node


$ salloc -N 1 -p active sh
salloc: Granted job allocation 16
sh-4.1$ hostname
hpcdev-005.sdsc.edu


That node is not part of the "active" partition but slurm still uses it.
How? The allocation btw is for NodeList=hpc-0-4
and the user can login to that node without a problem but slurm doesn't
run the sh on that node for the user.

Also how can a user find out what nodes are allocated without having to
run the scontrol command? Is there an option in salloc to return the
host names?

Thanks
Eva
Uwe Sauter
2014-09-12 07:05:31 UTC
Permalink
Hi Eva,

if you don't want to use the controller node for jobs, the easiest way
is to not configure it as node at all. Meaning you don't need a line like

NodeName=hpc-0-5 RealMemory=....

for the controller.


A program/user can find out which nodes are allocated by looking into
the environment variables. Try running salloc and then

$ env | grep SLURM

Here is an example output:

SLURM_NODELIST=n523601
SLURM_NODE_ALIASES=(null)
SLURM_NNODES=1
SLURM_JOBID=6437
SLURM_TASKS_PER_NODE=40
SLURM_JOB_ID=6437
SLURM_SUBMIT_DIR=/nfs/admins/adm17
SLURM_JOB_NODELIST=n523601
SLURM_JOB_CPUS_PER_NODE=40
SLURM_SUBMIT_HOST=frontend
SLURM_JOB_PARTITION=foo
SLURM_JOB_NUM_NODES=1



Regards,

Uwe
Post by Eva Hocks
I am trying to configure the latest slurm 14.03 and am running into
problem to prevent slurm from running jobs on the control node.
active up 2:00:00 1 down* hpc-0-5
active up 2:00:00 1 mix hpc-0-4
active up 2:00:00 1 idle hpc-0-6
but when I use salloc I end up on the head node
$ salloc -N 1 -p active sh
salloc: Granted job allocation 16
sh-4.1$ hostname
hpcdev-005.sdsc.edu
That node is not part of the "active" partition but slurm still uses it.
How? The allocation btw is for NodeList=hpc-0-4
and the user can login to that node without a problem but slurm doesn't
run the sh on that node for the user.
Also how can a user find out what nodes are allocated without having to
run the scontrol command? Is there an option in salloc to return the
host names?
Thanks
Eva
Sergio Iserte
2014-09-12 07:19:38 UTC
Permalink
Hello Eva,
you must remove the management nodes from the field "Nodes" of the
"PartitionName" parameter.

With the slurm.conf file would be easier to write an example, anyway this
should work!

Regards,
Sergio.
Post by Uwe Sauter
Hi Eva,
if you don't want to use the controller node for jobs, the easiest way
is to not configure it as node at all. Meaning you don't need a line like
NodeName=hpc-0-5 RealMemory=....
for the controller.
A program/user can find out which nodes are allocated by looking into
the environment variables. Try running salloc and then
$ env | grep SLURM
SLURM_NODELIST=n523601
SLURM_NODE_ALIASES=(null)
SLURM_NNODES=1
SLURM_JOBID=6437
SLURM_TASKS_PER_NODE=40
SLURM_JOB_ID=6437
SLURM_SUBMIT_DIR=/nfs/admins/adm17
SLURM_JOB_NODELIST=n523601
SLURM_JOB_CPUS_PER_NODE=40
SLURM_SUBMIT_HOST=frontend
SLURM_JOB_PARTITION=foo
SLURM_JOB_NUM_NODES=1
Regards,
Uwe
Post by Eva Hocks
I am trying to configure the latest slurm 14.03 and am running into
problem to prevent slurm from running jobs on the control node.
active up 2:00:00 1 down* hpc-0-5
active up 2:00:00 1 mix hpc-0-4
active up 2:00:00 1 idle hpc-0-6
but when I use salloc I end up on the head node
$ salloc -N 1 -p active sh
salloc: Granted job allocation 16
sh-4.1$ hostname
hpcdev-005.sdsc.edu
That node is not part of the "active" partition but slurm still uses it.
How? The allocation btw is for NodeList=hpc-0-4
and the user can login to that node without a problem but slurm doesn't
run the sh on that node for the user.
Also how can a user find out what nodes are allocated without having to
run the scontrol command? Is there an option in salloc to return the
host names?
Thanks
Eva
--
*Sergio Iserte Agut, research assistant,*
*High Performance Computing & Architecture*
*Jaume I University (Castellón, Spain)*
Chrysovalantis Paschoulas
2014-09-12 08:17:41 UTC
Permalink
Hi Eva!

As Sergio said, you have to specify the compute nodes with "NodeName=..." and then define partitions including those cnodes with "PartitionName=... Nodes=.." without including the head nodes or the login nodes. Also you could set in slurm.conf file the parameter "AllocNodes=..." where usually we give the login nodes only, in order to disable submission on any nodes other than the login nodes.

So my question now, is node hpcdev-005.sdsc.edu a login node or a master/admin node. I mean from that node or from a different one did you do the submission? Because if this one is the login node then there is no error at all, this is the default behaviour of salloc.

The default salloc (you can change it) returns you a shell on the node where the submission took place and then with srun commands you can execute programs on the compute nodes.

So in case you want an interactive shell on the compute nodes then you should execute:

"salloc -N1 -p active srun -N1 --pty sh"

or directly an srun command (without salloc involved):

"srun -N1 -p active --pty sh"

Best Regards,
Chrysovalantis Paschoulas



On 09/12/2014 09:19 AM, Sergio Iserte wrote:
Hello Eva,
you must remove the management nodes from the field "Nodes" of the "PartitionName" parameter.

With the slurm.conf file would be easier to write an example, anyway this should work!

Regards,
Sergio.

2014-09-12 9:06 GMT+02:00 Uwe Sauter <uwe.sauter.de-***@public.gmane.org<mailto:uwe.sauter.de-***@public.gmane.org>>:

Hi Eva,

if you don't want to use the controller node for jobs, the easiest way
is to not configure it as node at all. Meaning you don't need a line like

NodeName=hpc-0-5 RealMemory=....

for the controller.


A program/user can find out which nodes are allocated by looking into
the environment variables. Try running salloc and then

$ env | grep SLURM

Here is an example output:

SLURM_NODELIST=n523601
SLURM_NODE_ALIASES=(null)
SLURM_NNODES=1
SLURM_JOBID=6437
SLURM_TASKS_PER_NODE=40
SLURM_JOB_ID=6437
SLURM_SUBMIT_DIR=/nfs/admins/adm17
SLURM_JOB_NODELIST=n523601
SLURM_JOB_CPUS_PER_NODE=40
SLURM_SUBMIT_HOST=frontend
SLURM_JOB_PARTITION=foo
SLURM_JOB_NUM_NODES=1



Regards,

Uwe
Post by Eva Hocks
I am trying to configure the latest slurm 14.03 and am running into
problem to prevent slurm from running jobs on the control node.
active up 2:00:00 1 down* hpc-0-5
active up 2:00:00 1 mix hpc-0-4
active up 2:00:00 1 idle hpc-0-6
but when I use salloc I end up on the head node
$ salloc -N 1 -p active sh
salloc: Granted job allocation 16
sh-4.1$ hostname
hpcdev-005.sdsc.edu<http://hpcdev-005.sdsc.edu>
That node is not part of the "active" partition but slurm still uses it.
How? The allocation btw is for NodeList=hpc-0-4
and the user can login to that node without a problem but slurm doesn't
run the sh on that node for the user.
Also how can a user find out what nodes are allocated without having to
run the scontrol command? Is there an option in salloc to return the
host names?
Thanks
Eva
--
Sergio Iserte Agut, research assistant,
High Performance Computing & Architecture
Jaume I University (Castellón, Spain)





------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------
Forschungszentrum Juelich GmbH
52425 Juelich
Sitz der Gesellschaft: Juelich
Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498
Vorsitzender des Aufsichtsrats: MinDir Dr. Karl Eugen Huthmacher
Geschaeftsfuehrung: Prof. Dr.-Ing. Wolfgang Marquardt (Vorsitzender),
Karsten Beneke (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt,
Prof. Dr. Sebastian M. Schmidt
------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------
Eva Hocks
2014-09-12 18:21:43 UTC
Permalink
Thanks much for all your replies!

As for the configuration, in the PartitionName=3D Nodes=3D the head node is
never included in any of the partitions. There is a default NodeName in
the slurm configuration set by the install itself. I basically used the
same configuration as I used with slurm 2.6.0. That version never ran
jobs on the headnode, it also did not start the slurmd on the head node.
I removed the NodeName for the headnode and that took care of the slurmd
startup.

Thanks for the salloc clarification. I am still learning the nuances of
slurm coming from other schedulers (sge/loadleveler/torque).

Thanks
Eva
Post by Chrysovalantis Paschoulas
Hi Eva!
As Sergio said, you have to specify the compute nodes with "NodeName=3D..=
." and then define partitions including those cnodes with "PartitionName=3D=
... Nodes=3D.." without including the head nodes or the login nodes. Also y=
ou could set in slurm.conf file the parameter "AllocNodes=3D..." where usua=
lly we give the login nodes only, in order to disable submission on any nod=
es other than the login nodes.
Post by Chrysovalantis Paschoulas
So my question now, is node hpcdev-005.sdsc.edu a login node or a master/=
admin node. I mean from that node or from a different one did you do the su=
bmission? Because if this one is the login node then there is no error at a=
ll, this is the default behaviour of salloc.
Post by Chrysovalantis Paschoulas
The default salloc (you can change it) returns you a shell on the node wh=
ere the submission took place and then with srun commands you can execute p=
rograms on the compute nodes.
Post by Chrysovalantis Paschoulas
So in case you want an interactive shell on the compute nodes then you sh=
"salloc -N1 -p active srun -N1 --pty sh"
"srun -N1 -p active --pty sh"
Best Regards,
Chrysovalantis Paschoulas
Hello Eva,
you must remove the management nodes from the field "Nodes" of the "Parti=
tionName" parameter.
Post by Chrysovalantis Paschoulas
With the slurm.conf file would be easier to write an example, anyway this=
should work!
Post by Chrysovalantis Paschoulas
Regards,
Sergio.
Hi Eva,
if you don't want to use the controller node for jobs, the easiest way
is to not configure it as node at all. Meaning you don't need a line like
NodeName=3Dhpc-0-5 RealMemory=3D....
for the controller.
A program/user can find out which nodes are allocated by looking into
the environment variables. Try running salloc and then
$ env | grep SLURM
SLURM_NODELIST=3Dn523601
SLURM_NODE_ALIASES=3D(null)
SLURM_NNODES=3D1
SLURM_JOBID=3D6437
SLURM_TASKS_PER_NODE=3D40
SLURM_JOB_ID=3D6437
SLURM_SUBMIT_DIR=3D/nfs/admins/adm17
SLURM_JOB_NODELIST=3Dn523601
SLURM_JOB_CPUS_PER_NODE=3D40
SLURM_SUBMIT_HOST=3Dfrontend
SLURM_JOB_PARTITION=3Dfoo
SLURM_JOB_NUM_NODES=3D1
Regards,
Uwe
Post by Eva Hocks
I am trying to configure the latest slurm 14.03 and am running into
problem to prevent slurm from running jobs on the control node.
active up 2:00:00 1 down* hpc-0-5
active up 2:00:00 1 mix hpc-0-4
active up 2:00:00 1 idle hpc-0-6
but when I use salloc I end up on the head node
$ salloc -N 1 -p active sh
salloc: Granted job allocation 16
sh-4.1$ hostname
hpcdev-005.sdsc.edu<http://hpcdev-005.sdsc.edu>
That node is not part of the "active" partition but slurm still uses it.
How? The allocation btw is for NodeList=3Dhpc-0-4
and the user can login to that node without a problem but slurm doesn't
run the sh on that node for the user.
Also how can a user find out what nodes are allocated without having to
run the scontrol command? Is there an option in salloc to return the
host names?
Thanks
Eva
--
Sergio Iserte Agut, research assistant,
High Performance Computing & Architecture
Jaume I University (Castell=C3=B3n, Spain)
-------------------------------------------------------------------------=
-----------------------
Post by Chrysovalantis Paschoulas
-------------------------------------------------------------------------=
-----------------------
Post by Chrysovalantis Paschoulas
Forschungszentrum Juelich GmbH
52425 Juelich
Sitz der Gesellschaft: Juelich
Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498
Vorsitzender des Aufsichtsrats: MinDir Dr. Karl Eugen Huthmacher
Geschaeftsfuehrung: Prof. Dr.-Ing. Wolfgang Marquardt (Vorsitzender),
Karsten Beneke (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt,
Prof. Dr. Sebastian M. Schmidt
-------------------------------------------------------------------------=
-----------------------
Post by Chrysovalantis Paschoulas
-------------------------------------------------------------------------=
-----------------------
Bruce Roberts
2014-09-12 18:21:37 UTC
Permalink
I think the problem you have here is the salloc you ran doesn't
automatically send you to a node in your allocation. I am gussing you
ran your salloc from hpcdev-005 that is why hostname by itself returns
that. If you ran srun hostname inside of your salloc you would get on
the node in your allocation.
Post by Eva Hocks
I am trying to configure the latest slurm 14.03 and am running into
problem to prevent slurm from running jobs on the control node.
active up 2:00:00 1 down* hpc-0-5
active up 2:00:00 1 mix hpc-0-4
active up 2:00:00 1 idle hpc-0-6
but when I use salloc I end up on the head node
$ salloc -N 1 -p active sh
salloc: Granted job allocation 16
sh-4.1$ hostname
hpcdev-005.sdsc.edu
That node is not part of the "active" partition but slurm still uses it.
How? The allocation btw is for NodeList=hpc-0-4
and the user can login to that node without a problem but slurm doesn't
run the sh on that node for the user.
Also how can a user find out what nodes are allocated without having to
run the scontrol command? Is there an option in salloc to return the
host names?
Thanks
Eva
Eva Hocks
2014-09-12 18:35:30 UTC
Permalink
yes, that's exactly the problem. I assumed salloc was used to run an
interactive job but I was wrong. I guess I have to use srun for that.

I am still trying to figure how to do the user guide to explain when
to use salloc and then srun instead of srun in the first place.

Thanks a bunch for your help
Eva
Post by Bruce Roberts
I think the problem you have here is the salloc you ran doesn't
automatically send you to a node in your allocation. I am gussing you
ran your salloc from hpcdev-005 that is why hostname by itself returns
that. If you ran srun hostname inside of your salloc you would get on
the node in your allocation.
Post by Eva Hocks
I am trying to configure the latest slurm 14.03 and am running into
problem to prevent slurm from running jobs on the control node.
active up 2:00:00 1 down* hpc-0-5
active up 2:00:00 1 mix hpc-0-4
active up 2:00:00 1 idle hpc-0-6
but when I use salloc I end up on the head node
$ salloc -N 1 -p active sh
salloc: Granted job allocation 16
sh-4.1$ hostname
hpcdev-005.sdsc.edu
That node is not part of the "active" partition but slurm still uses it.
How? The allocation btw is for NodeList=hpc-0-4
and the user can login to that node without a problem but slurm doesn't
run the sh on that node for the user.
Also how can a user find out what nodes are allocated without having to
run the scontrol command? Is there an option in salloc to return the
host names?
Thanks
Eva
Loading...