Erica Riello
2014-09-08 12:53:35 UTC
Hello all,
I have 2 machines running Slurm 14.03.07, called torquepbs and torquepbsno1.
Slurmctld is running in torquepbs, and there's a slurmd running in
torquepbs and torquepbsno1.
They both have the same munge key and the same configuration file
(slurm.conf), but torquepbs is up and torquepbsno1 is down.
Munge daemon is running on both machines.
What can be wrong?
slurm.conf:
# slurm.conf file generated by configurator easy.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
ControlMachine=torquepbs
ControlAddr=localhost
#
MailProg=/bin/mail
MpiDefault=none
#MpiParams=ports=#-#
ProctrackType=proctrack/pgid
ReturnToService=1
SlurmctldPidFile=/var/run/slurmctld.pid
#SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
#SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=slurm
#SlurmdUser=root
StateSaveLocation=/var/spool/slurmctld
SwitchType=switch/none
TaskPlugin=task/none
#
#
# TIMERS
#KillWait=30
#MinJobAge=300
#SlurmctldTimeout=120
#SlurmdTimeout=300
#
#
# SCHEDULING
FastSchedule=1
SchedulerType=sched/backfill
#SchedulerPort=7321
SelectType=select/linear
#
#
# LOGGING AND ACCOUNTING
AccountingStorageType=accounting_storage/none
ClusterName=cluster
#JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/linux
#SlurmctldDebug=3
SlurmctldLogFile=/var/log/slurm-llnl/slurmcltd
#SlurmdDebug=3
SlurmdLogFile=/var/log/slurm-llnl/slurmd
#
#
# COMPUTE NODES
NodeName=torquepbsno1 CPUs=4 RealMemory=2001 Sockets=2 CoresPerSocket=2
ThreadsP
erCore=1 State=UNKNOWN
NodeName=torquepbsno2 CPUs=4 RealMemory=2001 Sockets=2 CoresPerSocket=2
ThreadsP
erCore=1 State=UNKNOWN
NodeName=torquepbs CPUs=4 RealMemory=2001 Sockets=2 CoresPerSocket=2
ThreadsPerC
ore=1 State=UNKNOWN
PartitionName=particao1 Nodes=torquepbs,torquepbsno1,torquepbsno2
Default=YES Ma
xTime=INFINITE State=UP
Thanks in advance.
I have 2 machines running Slurm 14.03.07, called torquepbs and torquepbsno1.
Slurmctld is running in torquepbs, and there's a slurmd running in
torquepbs and torquepbsno1.
They both have the same munge key and the same configuration file
(slurm.conf), but torquepbs is up and torquepbsno1 is down.
Munge daemon is running on both machines.
What can be wrong?
slurm.conf:
# slurm.conf file generated by configurator easy.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
ControlMachine=torquepbs
ControlAddr=localhost
#
MailProg=/bin/mail
MpiDefault=none
#MpiParams=ports=#-#
ProctrackType=proctrack/pgid
ReturnToService=1
SlurmctldPidFile=/var/run/slurmctld.pid
#SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
#SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=slurm
#SlurmdUser=root
StateSaveLocation=/var/spool/slurmctld
SwitchType=switch/none
TaskPlugin=task/none
#
#
# TIMERS
#KillWait=30
#MinJobAge=300
#SlurmctldTimeout=120
#SlurmdTimeout=300
#
#
# SCHEDULING
FastSchedule=1
SchedulerType=sched/backfill
#SchedulerPort=7321
SelectType=select/linear
#
#
# LOGGING AND ACCOUNTING
AccountingStorageType=accounting_storage/none
ClusterName=cluster
#JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/linux
#SlurmctldDebug=3
SlurmctldLogFile=/var/log/slurm-llnl/slurmcltd
#SlurmdDebug=3
SlurmdLogFile=/var/log/slurm-llnl/slurmd
#
#
# COMPUTE NODES
NodeName=torquepbsno1 CPUs=4 RealMemory=2001 Sockets=2 CoresPerSocket=2
ThreadsP
erCore=1 State=UNKNOWN
NodeName=torquepbsno2 CPUs=4 RealMemory=2001 Sockets=2 CoresPerSocket=2
ThreadsP
erCore=1 State=UNKNOWN
NodeName=torquepbs CPUs=4 RealMemory=2001 Sockets=2 CoresPerSocket=2
ThreadsPerC
ore=1 State=UNKNOWN
PartitionName=particao1 Nodes=torquepbs,torquepbsno1,torquepbsno2
Default=YES Ma
xTime=INFINITE State=UP
Thanks in advance.
--
===============
Erica Riello
Aluna Engenharia de Computação PUC-Rio
===============
Erica Riello
Aluna Engenharia de Computação PUC-Rio