Discussion:
slurm connectionrefused
Erica Riello
2014-08-26 18:22:57 UTC
Permalink
Hi all,

I'd like to know what kind of errors can lead to 'Connection refused'
message in slurm.

I've installed Slurm 14.03.6 in a 64bit Ubuntu VM (VirtualBox) and I get
this message when I run slurmctld: "problems with erica-VirtualBox", but it
has no hints of what's wrong.

The configuration file is:

***@erica-VirtualBox:/usr/local/etc# more slurm.conf
# slurm.conf file generated by configurator easy.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
ControlMachine=erica-VirtualBox
ControlAddr=localhost
#
MailProg=/usr/bin/mail
MpiDefault=none
#MpiParams=ports=#-#
ProctrackType=proctrack/pgid
ReturnToService=1
SlurmctldPidFile=/var/run/slurmctld.pid
#SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
#SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=slurm
#SlurmdUser=root
StateSaveLocation=/var/tmp/slurm
SwitchType=switch/none
TaskPlugin=task/none
#
#
# TIMERS
#KillWait=30
#MinJobAge=300
#SlurmctldTimeout=120
#SlurmdTimeout=300
#
#
# SCHEDULING
FastSchedule=1
SchedulerType=sched/backfill
#SchedulerPort=7321
SelectType=select/linear
#
#
# LOGGING AND ACCOUNTING
AccountingStorageType=accounting_storage/none
ClusterName=cluster
#JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/linux
#SlurmctldDebug=3
SlurmctldLogFile=/var/log/slurm/slurmctld
#SlurmdDebug=3
SlurmdLogFile=/var/log/slurm/slurmd
#
#
# COMPUTE NODES
NodeName=erica-VirtualBox CPUs=1 RealMemory=2002 Sockets=1 CoresPerSocket=1
ThreadsPerCore=1 State=UNKNOWN
PartitionName=particao1 Nodes=erica-VirtualBox Default=YES MaxTime=INFINITE
State=UP

The log of slurmctld shows:

***@erica-VirtualBox:/usr/local/etc# slurmctld -D -vvvv
slurmctld: pidfile not locked, assuming no running daemon
slurmctld: debug3: Version in last_conf_lite header is 6912
slurmctld: error: Job accounting information gathered, but not stored
slurmctld: slurmctld version 14.03.6 started on cluster cluster
slurmctld: debug3: Trying to load plugin
/usr/local/lib/slurm/crypto_munge.so
slurmctld: Munge cryptographic signature plugin loaded
slurmctld: debug3: Success.
slurmctld: debug3: Trying to load plugin
/usr/local/lib/slurm/select_linear.so
slurmctld: debug3: Success.
slurmctld: debug3: Trying to load plugin
/usr/local/lib/slurm/preempt_none.so
slurmctld: preempt/none loaded
slurmctld: debug3: Success.
slurmctld: debug3: Trying to load plugin
/usr/local/lib/slurm/checkpoint_none.so
slurmctld: debug3: Success.
slurmctld: Checkpoint plugin loaded: checkpoint/none
slurmctld: debug3: Trying to load plugin
/usr/local/lib/slurm/acct_gather_energy_none.so
slurmctld: AcctGatherEnergy NONE plugin loaded
slurmctld: debug3: Success.
slurmctld: debug3: Trying to load plugin
/usr/local/lib/slurm/acct_gather_profile_none.so
slurmctld: AcctGatherProfile NONE plugin loaded
slurmctld: debug3: Success.
slurmctld: debug3: Trying to load plugin
/usr/local/lib/slurm/acct_gather_infiniband_none.so
slurmctld: AcctGatherInfiniband NONE plugin loaded
slurmctld: debug3: Success.
slurmctld: debug3: Trying to load plugin
/usr/local/lib/slurm/acct_gather_filesystem_none.so
slurmctld: AcctGatherFilesystem NONE plugin loaded
slurmctld: debug3: Success.
slurmctld: debug2: No acct_gather.conf file
(/usr/local/etc/acct_gather.conf)
slurmctld: debug3: Trying to load plugin
/usr/local/lib/slurm/jobacct_gather_linux.so
slurmctld: Job accounting gather LINUX plugin loaded
slurmctld: debug3: Success.
slurmctld: WARNING: We will use a much slower algorithm with
proctrack/pgid, use Proctracktype=proctrack/linuxproc or some other
proctrack when using jobacct_gather/linux
slurmctld: error: WARNING: Even though we are collecting accounting
information you have asked for it not to be stored
(accounting_storage/none) if this is not what you have in mind you will
need to change it.
slurmctld: debug3: Trying to load plugin
/usr/local/lib/slurm/ext_sensors_none.so
slurmctld: ExtSensors NONE plugin loaded
slurmctld: debug3: Success.
slurmctld: debug3: Trying to load plugin /usr/local/lib/slurm/switch_none.so
slurmctld: switch NONE plugin loaded
slurmctld: debug3: Success.
slurmctld: debug: No backup controller to shutdown
slurmctld: debug3: Trying to load plugin
/usr/local/lib/slurm/accounting_storage_none.so
slurmctld: Accounting storage NOT INVOKED plugin loaded
slurmctld: debug3: Success.
slurmctld: debug3: not enforcing associations and no list was given so we
are giving a blank list
slurmctld: debug3: Version in assoc_mgr_state header is 1
slurmctld: debug: Reading slurm.conf file: /usr/local/etc/slurm.conf
slurmctld: debug3: Trying to load plugin
/usr/local/lib/slurm/topology_none.so
slurmctld: topology NONE plugin loaded
slurmctld: debug3: Success.
slurmctld: debug: No DownNodes
slurmctld: debug3: Trying to load plugin
/usr/local/lib/slurm/jobcomp_none.so
slurmctld: debug3: Success.
slurmctld: debug3: Trying to load plugin
/usr/local/lib/slurm/sched_backfill.so
slurmctld: sched: Backfill scheduler plugin loaded
slurmctld: debug3: Success.
slurmctld: debug3: Version string in node_state header is PROTOCOL_VERSION
slurmctld: Recovered state of 1 nodes
slurmctld: debug3: Version string in job_state header is PROTOCOL_VERSION
slurmctld: debug3: Job id in job_state header is 1
slurmctld: debug3: Set job_id_sequence to 1
slurmctld: Recovered information about 0 jobs
slurmctld: debug: Updating partition uid access list
slurmctld: debug3: Version string in resv_state header is PROTOCOL_VERSION
slurmctld: Recovered state of 0 reservations
slurmctld: State of 0 triggers recovered
slurmctld: read_slurm_conf: backup_controller not specified.
slurmctld: Running as primary controller
slurmctld: debug3: Trying to load plugin
/usr/local/lib/slurm/priority_basic.so
slurmctld: debug: Priority BASIC plugin loaded
slurmctld: debug3: Success.
slurmctld: debug3: _slurmctld_background pid = 5995
slurmctld: debug: power_save module disabled, SuspendTime < 0
slurmctld: debug3: _slurmctld_rpc_mgr pid = 5995
slurmctld: debug2: slurmctld listening on 0.0.0.0:6817
slurmctld: debug: Spawning registration agent for erica-VirtualBox 1 hosts
slurmctld: debug2: Spawning RPC agent for msg_type
REQUEST_NODE_REGISTRATION_STATUS
slurmctld: debug2: got 1 threads to send out
slurmctld: debug2: Tree head got back 0 looking for 1
slurmctld: debug3: Tree sending to erica-VirtualBox
slurmctld: debug2: _slurm_connect failed: Connection refused
slurmctld: debug2: Error connecting slurm stream socket at 127.0.1.1:6818:
Connection refused
slurmctld: debug3: connect refused, retrying
slurmctld: debug2: _slurm_connect failed: Connection refused
slurmctld: debug2: Error connecting slurm stream socket at 127.0.1.1:6818:
Connection refused
slurmctld: debug2: _slurm_connect failed: Connection refused
slurmctld: debug2: Error connecting slurm stream socket at 127.0.1.1:6818:
Connection refused
slurmctld: debug2: _slurm_connect failed: Connection refused
slurmctld: debug2: Error connecting slurm stream socket at 127.0.1.1:6818:
Connection refused
slurmctld: debug2: _slurm_connect failed: Connection refused
slurmctld: debug2: Error connecting slurm stream socket at 127.0.1.1:6818:
Connection refused
slurmctld: debug2: _slurm_connect failed: Connection refused
slurmctld: debug2: Error connecting slurm stream socket at 127.0.1.1:6818:
Connection refused
slurmctld: debug2: _slurm_connect failed: Connection refused
slurmctld: debug2: Error connecting slurm stream socket at 127.0.1.1:6818:
Connection refused
slurmctld: debug2: _slurm_connect failed: Connection refused
slurmctld: debug2: Error connecting slurm stream socket at 127.0.1.1:6818:
Connection refused
slurmctld: debug2: _slurm_connect failed: Connection refused
slurmctld: debug2: Error connecting slurm stream socket at 127.0.1.1:6818:
Connection refused
slurmctld: debug2: _slurm_connect failed: Connection refused
slurmctld: debug2: Error connecting slurm stream socket at 127.0.1.1:6818:
Connection refused
slurmctld: debug2: _slurm_connect failed: Connection refused
slurmctld: debug2: Error connecting slurm stream socket at 127.0.1.1:6818:
Connection refused
slurmctld: debug3: problems with erica-VirtualBox
slurmctld: debug2: Tree head got back 1
slurmctld: debug2: Tree head got back 1
slurmctld: agent/is_node_resp: node:erica-VirtualBox rpc:1001 :
Communication connection failure
^Cslurmctld: Terminate signal (SIGINT or SIGTERM) received
slurmctld: debug: sched: slurmctld terminating
slurmctld: debug3: _slurmctld_rpc_mgr shutting down
slurmctld: Saving all slurm state
slurmctld: debug3: Writing job id 1 to header record of job_state file
slurmctld: debug3: _slurmctld_background shutting down
slurmctld: Unable to remove pidfile '/var/run/slurmctld.pid': Permission
denied

Regards,
--
===============
Erica Riello
Loading...