Discussion:
cluster nodes down
Erica Riello
2014-09-08 12:53:35 UTC
Permalink
Hello all,

I have 2 machines running Slurm 14.03.07, called torquepbs and torquepbsno1.
Slurmctld is running in torquepbs, and there's a slurmd running in
torquepbs and torquepbsno1.
They both have the same munge key and the same configuration file
(slurm.conf), but torquepbs is up and torquepbsno1 is down.
Munge daemon is running on both machines.

What can be wrong?

slurm.conf:

# slurm.conf file generated by configurator easy.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
ControlMachine=torquepbs
ControlAddr=localhost
#
MailProg=/bin/mail
MpiDefault=none
#MpiParams=ports=#-#
ProctrackType=proctrack/pgid
ReturnToService=1
SlurmctldPidFile=/var/run/slurmctld.pid
#SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
#SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=slurm
#SlurmdUser=root
StateSaveLocation=/var/spool/slurmctld
SwitchType=switch/none
TaskPlugin=task/none
#
#
# TIMERS
#KillWait=30
#MinJobAge=300
#SlurmctldTimeout=120
#SlurmdTimeout=300
#
#
# SCHEDULING
FastSchedule=1
SchedulerType=sched/backfill
#SchedulerPort=7321
SelectType=select/linear
#
#
# LOGGING AND ACCOUNTING
AccountingStorageType=accounting_storage/none
ClusterName=cluster
#JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/linux
#SlurmctldDebug=3
SlurmctldLogFile=/var/log/slurm-llnl/slurmcltd
#SlurmdDebug=3
SlurmdLogFile=/var/log/slurm-llnl/slurmd
#
#
# COMPUTE NODES
NodeName=torquepbsno1 CPUs=4 RealMemory=2001 Sockets=2 CoresPerSocket=2
ThreadsP
erCore=1 State=UNKNOWN
NodeName=torquepbsno2 CPUs=4 RealMemory=2001 Sockets=2 CoresPerSocket=2
ThreadsP
erCore=1 State=UNKNOWN
NodeName=torquepbs CPUs=4 RealMemory=2001 Sockets=2 CoresPerSocket=2
ThreadsPerC
ore=1 State=UNKNOWN
PartitionName=particao1 Nodes=torquepbs,torquepbsno1,torquepbsno2
Default=YES Ma
xTime=INFINITE State=UP

Thanks in advance.
--
===============
Erica Riello
Aluna Engenharia de Computação PUC-Rio
Chrysovalantis Paschoulas
2014-09-08 14:36:41 UTC
Permalink
Hi!

At first, I would suggest you to change the "ControlAddr" parameter and set it to the local network IP so torquepbsno1 can access the server (but in our case, with lsof, I can see that slurmctld is listening on all interfaces.. "TCP *:6817 (LISTEN)", so maybe it is not this option, but maybe the IP is used by the clients....).

After that you should check the firewall rules or routing configuration maybe.

Also common reasons for clients not to able to connect is wrong Munge configuration (you said it is OK) or Slurm daemons version mismatch.

Of course some logs would be really helpful to understand the reasons of your problems ;)

Best Regards,
Chrysovalantis Paschoulas

Forschungszentrum Juelich - Juelich Supercomputing Centre


On 09/08/2014 02:53 PM, Erica Riello wrote:
Hello all,

I have 2 machines running Slurm 14.03.07, called torquepbs and torquepbsno1.
Slurmctld is running in torquepbs, and there's a slurmd running in torquepbs and torquepbsno1.
They both have the same munge key and the same configuration file (slurm.conf), but torquepbs is up and torquepbsno1 is down.
Munge daemon is running on both machines.

What can be wrong?

slurm.conf:

# slurm.conf file generated by configurator easy.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
ControlMachine=torquepbs
ControlAddr=localhost
#
MailProg=/bin/mail
MpiDefault=none
#MpiParams=ports=#-#
ProctrackType=proctrack/pgid
ReturnToService=1
SlurmctldPidFile=/var/run/slurmctld.pid
#SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
#SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=slurm
#SlurmdUser=root
StateSaveLocation=/var/spool/slurmctld
SwitchType=switch/none
TaskPlugin=task/none
#
#
# TIMERS
#KillWait=30
#MinJobAge=300
#SlurmctldTimeout=120
#SlurmdTimeout=300
#
#
# SCHEDULING
FastSchedule=1
SchedulerType=sched/backfill
#SchedulerPort=7321
SelectType=select/linear
#
#
# LOGGING AND ACCOUNTING
AccountingStorageType=accounting_storage/none
ClusterName=cluster
#JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/linux
#SlurmctldDebug=3
SlurmctldLogFile=/var/log/slurm-llnl/slurmcltd
#SlurmdDebug=3
SlurmdLogFile=/var/log/slurm-llnl/slurmd
#
#
# COMPUTE NODES
NodeName=torquepbsno1 CPUs=4 RealMemory=2001 Sockets=2 CoresPerSocket=2 ThreadsP
erCore=1 State=UNKNOWN
NodeName=torquepbsno2 CPUs=4 RealMemory=2001 Sockets=2 CoresPerSocket=2 ThreadsP
erCore=1 State=UNKNOWN
NodeName=torquepbs CPUs=4 RealMemory=2001 Sockets=2 CoresPerSocket=2 ThreadsPerC
ore=1 State=UNKNOWN
PartitionName=particao1 Nodes=torquepbs,torquepbsno1,torquepbsno2 Default=YES Ma
xTime=INFINITE State=UP

Thanks in advance.

--
===============
Erica Riello
Aluna Engenharia de Computação PUC-Rio





------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------
Forschungszentrum Juelich GmbH
52425 Juelich
Sitz der Gesellschaft: Juelich
Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498
Vorsitzender des Aufsichtsrats: MinDir Dr. Karl Eugen Huthmacher
Geschaeftsfuehrung: Prof. Dr.-Ing. Wolfgang Marquardt (Vorsitzender),
Karsten Beneke (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt,
Prof. Dr. Sebastian M. Schmidt
------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------
Erica Riello
2014-09-08 17:15:44 UTC
Permalink
Thanks for the answer Chrysovalantis!

It was actually a misconfiguration in munge that was causing the problem.

Thank you again.

2014-09-08 11:36 GMT-03:00 Chrysovalantis Paschoulas <
Post by Chrysovalantis Paschoulas
Hi!
At first, I would suggest you to change the "ControlAddr" parameter and
set it to the local network IP so torquepbsno1 can access the server (but
in our case, with lsof, I can see that slurmctld is listening on all
interfaces.. "TCP *:6817 (LISTEN)", so maybe it is not this option, but
maybe the IP is used by the clients....).
After that you should check the firewall rules or routing configuration maybe.
Also common reasons for clients not to able to connect is wrong Munge
configuration (you said it is OK) or Slurm daemons version mismatch.
Of course some logs would be really helpful to understand the reasons of your problems ;)
Best Regards,
Chrysovalantis Paschoulas
Forschungszentrum Juelich - Juelich Supercomputing Centre
Hello all,
I have 2 machines running Slurm 14.03.07, called torquepbs and torquepbsno1.
Slurmctld is running in torquepbs, and there's a slurmd running in
torquepbs and torquepbsno1.
They both have the same munge key and the same configuration file
(slurm.conf), but torquepbs is up and torquepbsno1 is down.
Munge daemon is running on both machines.
What can be wrong?
# slurm.conf file generated by configurator easy.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
ControlMachine=torquepbs
ControlAddr=localhost
#
MailProg=/bin/mail
MpiDefault=none
#MpiParams=ports=#-#
ProctrackType=proctrack/pgid
ReturnToService=1
SlurmctldPidFile=/var/run/slurmctld.pid
#SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
#SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=slurm
#SlurmdUser=root
StateSaveLocation=/var/spool/slurmctld
SwitchType=switch/none
TaskPlugin=task/none
#
#
# TIMERS
#KillWait=30
#MinJobAge=300
#SlurmctldTimeout=120
#SlurmdTimeout=300
#
#
# SCHEDULING
FastSchedule=1
SchedulerType=sched/backfill
#SchedulerPort=7321
SelectType=select/linear
#
#
# LOGGING AND ACCOUNTING
AccountingStorageType=accounting_storage/none
ClusterName=cluster
#JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/linux
#SlurmctldDebug=3
SlurmctldLogFile=/var/log/slurm-llnl/slurmcltd
#SlurmdDebug=3
SlurmdLogFile=/var/log/slurm-llnl/slurmd
#
#
# COMPUTE NODES
NodeName=torquepbsno1 CPUs=4 RealMemory=2001 Sockets=2 CoresPerSocket=2 ThreadsP
erCore=1 State=UNKNOWN
NodeName=torquepbsno2 CPUs=4 RealMemory=2001 Sockets=2 CoresPerSocket=2 ThreadsP
erCore=1 State=UNKNOWN
NodeName=torquepbs CPUs=4 RealMemory=2001 Sockets=2 CoresPerSocket=2 ThreadsPerC
ore=1 State=UNKNOWN
PartitionName=particao1 Nodes=torquepbs,torquepbsno1,torquepbsno2 Default=YES Ma
xTime=INFINITE State=UP
Thanks in advance.
--
===============
Erica Riello
Aluna Engenharia de Computação PUC-Rio
------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------
Forschungszentrum Juelich GmbH
52425 Juelich
Sitz der Gesellschaft: Juelich
Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498
Vorsitzender des Aufsichtsrats: MinDir Dr. Karl Eugen Huthmacher
Geschaeftsfuehrung: Prof. Dr.-Ing. Wolfgang Marquardt (Vorsitzender),
Karsten Beneke (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt,
Prof. Dr. Sebastian M. Schmidt
------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------
--
===============
Erica Riello
Aluna Engenharia de Computação PUC-Rio
Continue reading on narkive:
Loading...