Brad Reisfeld
2011-11-15 14:44:43 UTC
Hi,
I am trying to use slurm on a small cluster (master node + 5 compute
nodes). I am just getting started with slurm, so please forgive me
for bringing up what are likely very basic issues and problems. I
couldn't find relevant solutions by looking in the mailing list
archive or by googling.
platform: Linux CentOS v5
slurm: installed from rpms based on slurm-2.3.1.tar.bz2.
I installed munge-0.5.10 and it appears to be working on the master
and all of the compute nodes.
I have the ip addresses of the master node ('master') and compute
nodes ('cn1',...,'cn5') in /etc/hosts. The main machine ('bioshock')
has two network interfaces and I can successfully ping the master
node and all of the compute nodes from it.
I have the line 'ControlMachine=master' in my slurm.conf file.
When starting slurm through slurmctld, I experience a couple of
issues as shown below my signature.
In these messages, I don't know what to make of
'Invalid RPC received 2030 while in standby mode'
and I don't understand why I get
'Neither primary nor backup controller responding, sleep and retry'
when I can successfully ping the primary controller (which I assume
is the same as ControlMachine).
Strangely, after I execute
$ /etc/init.d/slurm start
The system seems to show that the primary/backup are up:
$ scontrol ping
Slurmctld(primary/backup) at master/bioshock are UP/UP
At this stage, if I execute 'scontrol show config', the command just
hangs and produces no output after several minutes. The command
'sinfo' also hangs.
If I then execute 'slurmctld' again, I get the same error messages
as shown below.
I'd appreciate any help or insights you can provide to help me
address these issues.
Thank you.
Kind regards,
Brad
==========
$ slurmctld -Dvvvv
slurmctld: pidfile not locked, assuming no running daemon
slurmctld: debug3: Trying to load plugin
/usr/lib64/slurm/accounting_storage_none.so
slurmctld: Accounting storage NOT INVOKED plugin loaded
slurmctld: debug3: Success.
slurmctld: debug3: not enforcing associations and no list was given
so we are giving a blank list
slurmctld: debug2: No Assoc usage file (/tmp/assoc_usage) to recover
slurmctld: slurmctld version 2.3.1 started on cluster cluster
slurmctld: debug3: Trying to load plugin
/usr/lib64/slurm/crypto_munge.so
slurmctld: Munge cryptographic signature plugin loaded
slurmctld: debug3: Success.
slurmctld: debug3: Trying to load plugin
/usr/lib64/slurm/select_cons_res.so
slurmctld: Consumable Resources (CR) Node Selection plugin loaded
with argument 4
slurmctld: debug3: Success.
slurmctld: debug3: Trying to load plugin
/usr/lib64/slurm/preempt_none.so
slurmctld: preempt/none loaded
slurmctld: debug3: Success.
slurmctld: debug3: Trying to load plugin
/usr/lib64/slurm/checkpoint_none.so
slurmctld: debug3: Success.
slurmctld: Checkpoint plugin loaded: checkpoint/none
slurmctld: debug3: Trying to load plugin
/usr/lib64/slurm/jobacct_gather_none.so
slurmctld: Job accounting gather NOT_INVOKED plugin loaded
slurmctld: debug3: Success.
slurmctld: slurmctld running in background mode
slurmctld: debug3: _background_rpc_mgr pid = 32571
slurmctld: debug3: Trying to load plugin /usr/lib64/slurm/auth_munge.so
slurmctld: auth plugin for Munge (http://home.gna.org/munge/) loaded
slurmctld: debug3: Success.
slurmctld: error: Invalid RPC received 2030 while in standby mode
slurmctld: debug: Neither primary nor backup controller responding,
sleep and retry
slurmctld: error: Invalid RPC received 2030 while in standby mode
slurmctld: debug: Neither primary nor backup controller responding,
sleep and retry
slurmctld: error: Invalid RPC received 2030 while in standby mode
slurmctld: debug: Neither primary nor backup controller responding,
sleep and retry
...
I am trying to use slurm on a small cluster (master node + 5 compute
nodes). I am just getting started with slurm, so please forgive me
for bringing up what are likely very basic issues and problems. I
couldn't find relevant solutions by looking in the mailing list
archive or by googling.
platform: Linux CentOS v5
slurm: installed from rpms based on slurm-2.3.1.tar.bz2.
I installed munge-0.5.10 and it appears to be working on the master
and all of the compute nodes.
I have the ip addresses of the master node ('master') and compute
nodes ('cn1',...,'cn5') in /etc/hosts. The main machine ('bioshock')
has two network interfaces and I can successfully ping the master
node and all of the compute nodes from it.
I have the line 'ControlMachine=master' in my slurm.conf file.
When starting slurm through slurmctld, I experience a couple of
issues as shown below my signature.
In these messages, I don't know what to make of
'Invalid RPC received 2030 while in standby mode'
and I don't understand why I get
'Neither primary nor backup controller responding, sleep and retry'
when I can successfully ping the primary controller (which I assume
is the same as ControlMachine).
Strangely, after I execute
$ /etc/init.d/slurm start
The system seems to show that the primary/backup are up:
$ scontrol ping
Slurmctld(primary/backup) at master/bioshock are UP/UP
At this stage, if I execute 'scontrol show config', the command just
hangs and produces no output after several minutes. The command
'sinfo' also hangs.
If I then execute 'slurmctld' again, I get the same error messages
as shown below.
I'd appreciate any help or insights you can provide to help me
address these issues.
Thank you.
Kind regards,
Brad
==========
$ slurmctld -Dvvvv
slurmctld: pidfile not locked, assuming no running daemon
slurmctld: debug3: Trying to load plugin
/usr/lib64/slurm/accounting_storage_none.so
slurmctld: Accounting storage NOT INVOKED plugin loaded
slurmctld: debug3: Success.
slurmctld: debug3: not enforcing associations and no list was given
so we are giving a blank list
slurmctld: debug2: No Assoc usage file (/tmp/assoc_usage) to recover
slurmctld: slurmctld version 2.3.1 started on cluster cluster
slurmctld: debug3: Trying to load plugin
/usr/lib64/slurm/crypto_munge.so
slurmctld: Munge cryptographic signature plugin loaded
slurmctld: debug3: Success.
slurmctld: debug3: Trying to load plugin
/usr/lib64/slurm/select_cons_res.so
slurmctld: Consumable Resources (CR) Node Selection plugin loaded
with argument 4
slurmctld: debug3: Success.
slurmctld: debug3: Trying to load plugin
/usr/lib64/slurm/preempt_none.so
slurmctld: preempt/none loaded
slurmctld: debug3: Success.
slurmctld: debug3: Trying to load plugin
/usr/lib64/slurm/checkpoint_none.so
slurmctld: debug3: Success.
slurmctld: Checkpoint plugin loaded: checkpoint/none
slurmctld: debug3: Trying to load plugin
/usr/lib64/slurm/jobacct_gather_none.so
slurmctld: Job accounting gather NOT_INVOKED plugin loaded
slurmctld: debug3: Success.
slurmctld: slurmctld running in background mode
slurmctld: debug3: _background_rpc_mgr pid = 32571
slurmctld: debug3: Trying to load plugin /usr/lib64/slurm/auth_munge.so
slurmctld: auth plugin for Munge (http://home.gna.org/munge/) loaded
slurmctld: debug3: Success.
slurmctld: error: Invalid RPC received 2030 while in standby mode
slurmctld: debug: Neither primary nor backup controller responding,
sleep and retry
slurmctld: error: Invalid RPC received 2030 while in standby mode
slurmctld: debug: Neither primary nor backup controller responding,
sleep and retry
slurmctld: error: Invalid RPC received 2030 while in standby mode
slurmctld: debug: Neither primary nor backup controller responding,
sleep and retry
...