newbie issue with new slurm install

Discussion:

Brad Reisfeld

2011-11-15 14:44:43 UTC

Hi,

I am trying to use slurm on a small cluster (master node + 5 compute
nodes). I am just getting started with slurm, so please forgive me
for bringing up what are likely very basic issues and problems. I
couldn't find relevant solutions by looking in the mailing list
archive or by googling.

platform: Linux CentOS v5
slurm: installed from rpms based on slurm-2.3.1.tar.bz2.

I installed munge-0.5.10 and it appears to be working on the master
and all of the compute nodes.

I have the ip addresses of the master node ('master') and compute
nodes ('cn1',...,'cn5') in /etc/hosts. The main machine ('bioshock')
has two network interfaces and I can successfully ping the master
node and all of the compute nodes from it.

I have the line 'ControlMachine=master' in my slurm.conf file.

When starting slurm through slurmctld, I experience a couple of
issues as shown below my signature.

In these messages, I don't know what to make of
'Invalid RPC received 2030 while in standby mode'
and I don't understand why I get
'Neither primary nor backup controller responding, sleep and retry'
when I can successfully ping the primary controller (which I assume
is the same as ControlMachine).

Strangely, after I execute

$ /etc/init.d/slurm start

The system seems to show that the primary/backup are up:

$ scontrol ping
Slurmctld(primary/backup) at master/bioshock are UP/UP

At this stage, if I execute 'scontrol show config', the command just
hangs and produces no output after several minutes. The command
'sinfo' also hangs.

If I then execute 'slurmctld' again, I get the same error messages
as shown below.

I'd appreciate any help or insights you can provide to help me
address these issues.

Thank you.

Kind regards,
Brad

==========

$ slurmctld -Dvvvv
slurmctld: pidfile not locked, assuming no running daemon
slurmctld: debug3: Trying to load plugin
/usr/lib64/slurm/accounting_storage_none.so
slurmctld: Accounting storage NOT INVOKED plugin loaded
slurmctld: debug3: Success.
slurmctld: debug3: not enforcing associations and no list was given
so we are giving a blank list
slurmctld: debug2: No Assoc usage file (/tmp/assoc_usage) to recover
slurmctld: slurmctld version 2.3.1 started on cluster cluster
slurmctld: debug3: Trying to load plugin
/usr/lib64/slurm/crypto_munge.so
slurmctld: Munge cryptographic signature plugin loaded
slurmctld: debug3: Success.
slurmctld: debug3: Trying to load plugin
/usr/lib64/slurm/select_cons_res.so
slurmctld: Consumable Resources (CR) Node Selection plugin loaded
with argument 4
slurmctld: debug3: Success.
slurmctld: debug3: Trying to load plugin
/usr/lib64/slurm/preempt_none.so
slurmctld: preempt/none loaded
slurmctld: debug3: Success.
slurmctld: debug3: Trying to load plugin
/usr/lib64/slurm/checkpoint_none.so
slurmctld: debug3: Success.
slurmctld: Checkpoint plugin loaded: checkpoint/none
slurmctld: debug3: Trying to load plugin
/usr/lib64/slurm/jobacct_gather_none.so
slurmctld: Job accounting gather NOT_INVOKED plugin loaded
slurmctld: debug3: Success.
slurmctld: slurmctld running in background mode
slurmctld: debug3: _background_rpc_mgr pid = 32571
slurmctld: debug3: Trying to load plugin /usr/lib64/slurm/auth_munge.so
slurmctld: auth plugin for Munge (http://home.gna.org/munge/) loaded
slurmctld: debug3: Success.
slurmctld: error: Invalid RPC received 2030 while in standby mode
slurmctld: debug: Neither primary nor backup controller responding,
sleep and retry
slurmctld: error: Invalid RPC received 2030 while in standby mode
slurmctld: debug: Neither primary nor backup controller responding,
sleep and retry
slurmctld: error: Invalid RPC received 2030 while in standby mode
slurmctld: debug: Neither primary nor backup controller responding,
sleep and retry
...

Andy Riebs

2011-11-15 15:08:04 UTC

Permalink

Hi Brad,

Just to get a couple of easy questions out of the way:

0. When you say that "bioshock" is "the main machine," does that mean
it's the master node?
1. Have the /etc/hosts definitions been propagated across the cluster?
2. Can you ping master from each of the clients, and vice versa?
3. Does "munge -n | unmunge" generate a successful result on each of the
nodes?

Andy

Post by Brad Reisfeld
Hi,
I am trying to use slurm on a small cluster (master node + 5 compute
nodes). I am just getting started with slurm, so please forgive me
for bringing up what are likely very basic issues and problems. I
couldn't find relevant solutions by looking in the mailing list
archive or by googling.
platform: Linux CentOS v5
slurm: installed from rpms based on slurm-2.3.1.tar.bz2.
I installed munge-0.5.10 and it appears to be working on the master
and all of the compute nodes.
I have the ip addresses of the master node ('master') and compute
nodes ('cn1',...,'cn5') in /etc/hosts. The main machine ('bioshock')
has two network interfaces and I can successfully ping the master
node and all of the compute nodes from it.
I have the line 'ControlMachine=master' in my slurm.conf file.
When starting slurm through slurmctld, I experience a couple of
issues as shown below my signature.
In these messages, I don't know what to make of
'Invalid RPC received 2030 while in standby mode'
and I don't understand why I get
'Neither primary nor backup controller responding, sleep and retry'
when I can successfully ping the primary controller (which I assume
is the same as ControlMachine).
Strangely, after I execute
$ /etc/init.d/slurm start
$ scontrol ping
Slurmctld(primary/backup) at master/bioshock are UP/UP
At this stage, if I execute 'scontrol show config', the command just
hangs and produces no output after several minutes. The command
'sinfo' also hangs.
If I then execute 'slurmctld' again, I get the same error messages
as shown below.
I'd appreciate any help or insights you can provide to help me
address these issues.
Thank you.
Kind regards,
Brad
==========
$ slurmctld -Dvvvv
slurmctld: pidfile not locked, assuming no running daemon
slurmctld: debug3: Trying to load plugin
/usr/lib64/slurm/accounting_storage_none.so
slurmctld: Accounting storage NOT INVOKED plugin loaded
slurmctld: debug3: Success.
slurmctld: debug3: not enforcing associations and no list was given
so we are giving a blank list
slurmctld: debug2: No Assoc usage file (/tmp/assoc_usage) to recover
slurmctld: slurmctld version 2.3.1 started on cluster cluster
slurmctld: debug3: Trying to load plugin
/usr/lib64/slurm/crypto_munge.so
slurmctld: Munge cryptographic signature plugin loaded
slurmctld: debug3: Success.
slurmctld: debug3: Trying to load plugin
/usr/lib64/slurm/select_cons_res.so
slurmctld: Consumable Resources (CR) Node Selection plugin loaded
with argument 4
slurmctld: debug3: Success.
slurmctld: debug3: Trying to load plugin
/usr/lib64/slurm/preempt_none.so
slurmctld: preempt/none loaded
slurmctld: debug3: Success.
slurmctld: debug3: Trying to load plugin
/usr/lib64/slurm/checkpoint_none.so
slurmctld: debug3: Success.
slurmctld: Checkpoint plugin loaded: checkpoint/none
slurmctld: debug3: Trying to load plugin
/usr/lib64/slurm/jobacct_gather_none.so
slurmctld: Job accounting gather NOT_INVOKED plugin loaded
slurmctld: debug3: Success.
slurmctld: slurmctld running in background mode
slurmctld: debug3: _background_rpc_mgr pid = 32571
slurmctld: debug3: Trying to load plugin /usr/lib64/slurm/auth_munge.so
slurmctld: auth plugin for Munge (http://home.gna.org/munge/) loaded
slurmctld: debug3: Success.
slurmctld: error: Invalid RPC received 2030 while in standby mode
slurmctld: debug: Neither primary nor backup controller responding,
sleep and retry
slurmctld: error: Invalid RPC received 2030 while in standby mode
slurmctld: debug: Neither primary nor backup controller responding,
sleep and retry
slurmctld: error: Invalid RPC received 2030 while in standby mode
slurmctld: debug: Neither primary nor backup controller responding,
sleep and retry
...

--
Andy Riebs
Hewlett-Packard Company
High Performance Computing
+1-786-263-9743
My opinions are not necessarily those of HP

Brad Reisfeld

2011-11-15 16:59:29 UTC

Permalink

Post by Andy Riebs
0. When you say that "bioshock" is "the main machine," does that
mean it's the master node?

The main server has a 'public' network interface with an ip address
(129.82.X.X) that resolves to 'bioshock'. It also has a 'private'
network interface with an ip address (10.0.0.1) that resolves to
'master'.

Post by Andy Riebs
1. Have the /etc/hosts definitions been propagated across the cluster?

Yes.

Post by Andy Riebs
2. Can you ping master from each of the clients, and vice versa?

Yes.

Post by Andy Riebs
3. Does "munge -n | unmunge" generate a successful result on each of
the nodes?

Yes, for instance...
$ munge -n | ssh cn1 unmunge
STATUS: Success (0)
...

Post by Andy Riebs
Andy

Thank you for your help.
Kind regards,
Brad

Oh well, so much for the easy ones.
I'd suggest re-sending your base note, with your responses to my
questions, and a copy of your slurm.conf to the mailing list to get
some more eyes on it.
(Your unstated assumption is correct -- what you are seeing is
rather odd.)
Cheers
Andy

Thanks, Andy.

If anyone has any ideas, please let me know.

Below is the contents of my slurm.conf file.

Kind regards,
Brad

==

# slurm.conf file generated by configurator.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
ControlMachine=master
#ControlAddr=
BackupController=bioshock
#BackupAddr=
#
AuthType=auth/munge
CacheGroups=0
#CheckpointType=checkpoint/none
CryptoType=crypto/munge
#DisableRootJobs=NO
#EnforcePartLimits=NO
#Epilog=
#PrologSlurmctld=
#FirstJobId=1
#GresTypes=
#GroupUpdateForce=0
#GroupUpdateTime=600
#JobCheckpointDir=/var/slurm/checkpoint
#JobCredentialPrivateKey=
#JobCredentialPublicCertificate=
#JobFileAppend=0
#JobRequeue=1
#JobSubmitPlugins=1
#KillOnBadExit=0
#Licenses=foo*4,bar
#MailProg=/bin/mail
#MaxJobCount=5000
#MaxTasksPerNode=128
MpiDefault=none
#MpiParams=ports=#-#
#PluginDir=
#PlugStackConfig=
#PrivateData=jobs
ProctrackType=proctrack/pgid
#Prolog=
#PrologSlurmctld=
#PropagatePrioProcess=0
#PropagateResourceLimits=
#PropagateResourceLimitsExcept=
ReturnToService=1
#SallocDefaultCommand=
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/tmp/slurmd
SlurmUser=slurm
#SrunEpilog=
#SrunProlog=
StateSaveLocation=/tmp
SwitchType=switch/none
#TaskEpilog=
TaskPlugin=task/none
#TaskPluginParam=
#TaskProlog=
#TopologyPlugin=topology/tree
#TmpFs=/tmp
#TrackWCKey=no
#TreeWidth=
#UnkillableStepProgram=
#UsePAM=0
#
#
# TIMERS
#BatchStartTimeout=10
#CompleteWait=0
#EpilogMsgTime=2000
#GetEnvTimeout=2
#HealthCheckInterval=0
#HealthCheckProgram=
InactiveLimit=0
KillWait=30
#MessageTimeout=10
#ResvOverRun=0
MinJobAge=300
#OverTimeLimit=0
SlurmctldTimeout=120
SlurmdTimeout=300
#UnkillableStepTimeout=60
#VSizeFactor=0
Waittime=0
#
#
# SCHEDULING
#DefMemPerCPU=0
FastSchedule=1
#MaxMemPerCPU=0
#SchedulerRootFilter=1
#SchedulerTimeSlice=30
SchedulerType=sched/backfill
SchedulerPort=7321
SelectType=select/cons_res
SelectTypeParameters=CR_Core
#
#
# JOB PRIORITY
#PriorityType=priority/basic
#PriorityDecayHalfLife=
#PriorityCalcPeriod=
#PriorityFavorSmall=
#PriorityMaxAge=
#PriorityUsageResetPeriod=
#PriorityWeightAge=
#PriorityWeightFairshare=
#PriorityWeightJobSize=
#PriorityWeightPartition=
#PriorityWeightQOS=
#
#
# LOGGING AND ACCOUNTING
#AccountingStorageEnforce=0
#AccountingStorageHost=
#AccountingStorageLoc=
#AccountingStoragePass=
#AccountingStoragePort=
AccountingStorageType=accounting_storage/none
#AccountingStorageUser=
ClusterName=cluster
#DebugFlags=
#JobCompHost=
#JobCompLoc=
#JobCompPass=
#JobCompPort=
JobCompType=jobcomp/none
#JobCompUser=
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/none
SlurmctldDebug=3
#SlurmctldLogFile=
SlurmdDebug=3
#SlurmdLogFile=
#SlurmSchedLogFile=
#SlurmSchedLogLevel=
#
#
# POWER SAVE SUPPORT FOR IDLE NODES (optional)
#SuspendProgram=
#ResumeProgram=
#SuspendTimeout=
#ResumeTimeout=
#ResumeRate=
#SuspendExcNodes=
#SuspendExcParts=
#SuspendRate=
#SuspendTime=
#
#
# COMPUTE NODES
NodeName=cn[1-5] Sockets=2 CoresPerSocket=4 State=UNKNOWN
PartitionName=default Nodes=cn[1-5] Default=YES MaxTime=INFINITE
State=UP

Andy Riebs

2011-11-15 17:31:04 UTC

Permalink

Brad, try disabling (commenting out) the BackupController definition.
It's not inconceivable that SLURM is getting confused by trying to run 2
copies of the daemon on the same node.

Andy

Post by Brad Reisfeld

Post by Andy Riebs
0. When you say that "bioshock" is "the main machine," does that
mean it's the master node?

Post by Andy Riebs
1. Have the /etc/hosts definitions been propagated across the cluster?

Yes.

Post by Andy Riebs
2. Can you ping master from each of the clients, and vice versa?

Yes.

Post by Andy Riebs
3. Does "munge -n | unmunge" generate a successful result on each of
the nodes?

Yes, for instance...
$ munge -n | ssh cn1 unmunge
STATUS: Success (0)
...

Post by Andy Riebs
Andy

Thank you for your help.
Kind regards,
Brad

Thanks, Andy.
If anyone has any ideas, please let me know.
Below is the contents of my slurm.conf file.
Kind regards,
Brad
==
# slurm.conf file generated by configurator.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
ControlMachine=master
#ControlAddr=
BackupController=bioshock
#BackupAddr=
#
AuthType=auth/munge
CacheGroups=0
#CheckpointType=checkpoint/none
CryptoType=crypto/munge
#DisableRootJobs=NO
#EnforcePartLimits=NO
#Epilog=
#PrologSlurmctld=
#FirstJobId=1
#GresTypes=
#GroupUpdateForce=0
#GroupUpdateTime=600
#JobCheckpointDir=/var/slurm/checkpoint
#JobCredentialPrivateKey=
#JobCredentialPublicCertificate=
#JobFileAppend=0
#JobRequeue=1
#JobSubmitPlugins=1
#KillOnBadExit=0
#Licenses=foo*4,bar
#MailProg=/bin/mail
#MaxJobCount=5000
#MaxTasksPerNode=128
MpiDefault=none
#MpiParams=ports=#-#
#PluginDir=
#PlugStackConfig=
#PrivateData=jobs
ProctrackType=proctrack/pgid
#Prolog=
#PrologSlurmctld=
#PropagatePrioProcess=0
#PropagateResourceLimits=
#PropagateResourceLimitsExcept=
ReturnToService=1
#SallocDefaultCommand=
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/tmp/slurmd
SlurmUser=slurm
#SrunEpilog=
#SrunProlog=
StateSaveLocation=/tmp
SwitchType=switch/none
#TaskEpilog=
TaskPlugin=task/none
#TaskPluginParam=
#TaskProlog=
#TopologyPlugin=topology/tree
#TmpFs=/tmp
#TrackWCKey=no
#TreeWidth=
#UnkillableStepProgram=
#UsePAM=0
#
#
# TIMERS
#BatchStartTimeout=10
#CompleteWait=0
#EpilogMsgTime=2000
#GetEnvTimeout=2
#HealthCheckInterval=0
#HealthCheckProgram=
InactiveLimit=0
KillWait=30
#MessageTimeout=10
#ResvOverRun=0
MinJobAge=300
#OverTimeLimit=0
SlurmctldTimeout=120
SlurmdTimeout=300
#UnkillableStepTimeout=60
#VSizeFactor=0
Waittime=0
#
#
# SCHEDULING
#DefMemPerCPU=0
FastSchedule=1
#MaxMemPerCPU=0
#SchedulerRootFilter=1
#SchedulerTimeSlice=30
SchedulerType=sched/backfill
SchedulerPort=7321
SelectType=select/cons_res
SelectTypeParameters=CR_Core
#
#
# JOB PRIORITY
#PriorityType=priority/basic
#PriorityDecayHalfLife=
#PriorityCalcPeriod=
#PriorityFavorSmall=
#PriorityMaxAge=
#PriorityUsageResetPeriod=
#PriorityWeightAge=
#PriorityWeightFairshare=
#PriorityWeightJobSize=
#PriorityWeightPartition=
#PriorityWeightQOS=
#
#
# LOGGING AND ACCOUNTING
#AccountingStorageEnforce=0
#AccountingStorageHost=
#AccountingStorageLoc=
#AccountingStoragePass=
#AccountingStoragePort=
AccountingStorageType=accounting_storage/none
#AccountingStorageUser=
ClusterName=cluster
#DebugFlags=
#JobCompHost=
#JobCompLoc=
#JobCompPass=
#JobCompPort=
JobCompType=jobcomp/none
#JobCompUser=
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/none
SlurmctldDebug=3
#SlurmctldLogFile=
SlurmdDebug=3
#SlurmdLogFile=
#SlurmSchedLogFile=
#SlurmSchedLogLevel=
#
#
# POWER SAVE SUPPORT FOR IDLE NODES (optional)
#SuspendProgram=
#ResumeProgram=
#SuspendTimeout=
#ResumeTimeout=
#ResumeRate=
#SuspendExcNodes=
#SuspendExcParts=
#SuspendRate=
#SuspendTime=
#
#
# COMPUTE NODES
NodeName=cn[1-5] Sockets=2 CoresPerSocket=4 State=UNKNOWN
PartitionName=default Nodes=cn[1-5] Default=YES MaxTime=INFINITE
State=UP

--
Andy Riebs
Hewlett-Packard Company
High Performance Computing
+1-786-263-9743
My opinions are not necessarily those of HP

Brad Reisfeld

2011-11-15 20:42:09 UTC

Permalink

Post by Andy Riebs
Brad, try disabling (commenting out) the BackupController
definition. It's not inconceivable that SLURM is getting confused by
trying to run 2 copies of the daemon on the same node.
Andy

Hi Andy,

I appreciate the suggestion.

I tried that change and get the following:

$ slurmctld -Dvvv
...
slurmctld: error: this host (bioshock) not valid controller (master
or (null))

So, it appears that this is disallowed because the ControlMachine or
BackupController is not set to be the machine hostname. How is this
normally done for a master node that has both a public and
cluster-private network interface?

Thank you.

Kind regards,
Brad

Andy Riebs

2011-11-15 21:04:59 UTC

Permalink

Hi Brad,

I believe that the problem here is that slurmctld is doing the
equivalent of `hostname -s` which is returning "bioshock", thus telling
slurmctld that it doesn't belong here.

The easiest way to resolve the problem would be to use "bioshock" for
SLURM's ControlMachine argument; remember that all IP traffic will
actually be routed by IP adress, rather than network or host name, so
this shouldn't confuse anything.

It may be possible, instead, to set ControlAddr=master, and
ControlMachine=bioshock, but my test bed is currently down, so I can't
check this out.

Or am I missing some facet of this?

Andy

Post by Brad Reisfeld

Hi Andy,
I appreciate the suggestion.
$ slurmctld -Dvvv
...
slurmctld: error: this host (bioshock) not valid controller (master
or (null))
So, it appears that this is disallowed because the ControlMachine or
BackupController is not set to be the machine hostname. How is this
normally done for a master node that has both a public and
cluster-private network interface?
Thank you.
Kind regards,
Brad

--
Andy Riebs
Hewlett-Packard Company
High Performance Computing
+1-786-263-9743
My opinions are not necessarily those of HP