Did you check your firewall rules? As you can see you need to allow
connections on port 6818...
Post by Monica MaratheHey Michael,
# slurm.conf file generated by configurator.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
ControlMachine=control-machine
#ControlAddr=
#BackupController=
#BackupAddr=
#
AuthType=auth/munge
CacheGroups=0
#CheckpointType=checkpoint/none
CryptoType=crypto/munge
#DisableRootJobs=NO
#EnforcePartLimits=NO
#Epilog=
#EpilogSlurmctld=
#FirstJobId=1
#MaxJobId=999999
#GresTypes=
#GroupUpdateForce=0
#GroupUpdateTime=600
#JobCheckpointDir=/var/slurm/checkpoint
#JobCredentialPrivateKey=
#JobCredentialPublicCertificate=
#JobFileAppend=0
#JobRequeue=1
#JobSubmitPlugins=1
#KillOnBadExit=0
#Licenses=foo*4,bar
#MailProg=/bin/mail
#MaxJobCount=5000
#MaxStepCount=40000
#MaxTasksPerNode=128
MpiDefault=none
#MpiParams=ports=#-#
#PluginDir=
#PlugStackConfig=
#PrivateData=jobs
ProctrackType=proctrack/pgid
#Prolog=
#PrologSlurmctld=
#PropagatePrioProcess=0
#PropagateResourceLimits=
#PropagateResourceLimitsExcept=
ReturnToService=1
#SallocDefaultCommand=
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/tmp/slurmd
SlurmUser=slurm
#SlurmdUser=root
#SrunEpilog=
#SrunProlog=
StateSaveLocation=/tmp
SwitchType=switch/none
#TaskEpilog=
TaskPlugin=task/none
#TaskPluginParam=
#TaskProlog=
#TopologyPlugin=topology/tree
#TmpFs=/tmp
#TrackWCKey=no
#TreeWidth=
#UnkillableStepProgram=
#UsePAM=0
#
#
# TIMERS
#BatchStartTimeout=10
#CompleteWait=0
#EpilogMsgTime=2000
#GetEnvTimeout=2
#HealthCheckInterval=0
#HealthCheckProgram=
InactiveLimit=0
KillWait=30
#MessageTimeout=10
#ResvOverRun=0
MinJobAge=300
#OverTimeLimit=0
SlurmctldTimeout=120
SlurmdTimeout=300
#UnkillableStepTimeout=60
#VSizeFactor=0
Waittime=0
#
#
# SCHEDULING
#DefMemPerCPU=0
FastSchedule=1
#MaxMemPerCPU=0
#SchedulerRootFilter=1
#SchedulerTimeSlice=30
SchedulerType=sched/backfill
SchedulerPort=7321
SelectType=select/linear
#SelectTypeParameters=
#
#
# JOB PRIORITY
#PriorityType=priority/basic
#PriorityDecayHalfLife=
#PriorityCalcPeriod=
#PriorityFavorSmall=
#PriorityMaxAge=
#PriorityUsageResetPeriod=
#PriorityWeightAge=
#PriorityWeightFairshare=
#PriorityWeightJobSize=
#PriorityWeightPartition=
#PriorityWeightQOS=
#
#
# LOGGING AND ACCOUNTING
#AccountingStorageEnforce=0
#AccountingStorageHost=
#AccountingStorageLoc=
#AccountingStoragePass=
#AccountingStoragePort=
AccountingStorageType=accounting_storage/none
#AccountingStorageUser=
AccountingStoreJobComment=YES
ClusterName=cluster
#DebugFlags=
#JobCompHost=
#JobCompLoc=
#JobCompPass=
#JobCompPort=
JobCompType=jobcomp/none
#JobCompUser=
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/none
SlurmctldDebug=3
#SlurmctldLogFile=
SlurmdDebug=3
#SlurmdLogFile=
#SlurmSchedLogFile=
#SlurmSchedLogLevel=
#
#
# POWER SAVE SUPPORT FOR IDLE NODES (optional)
#SuspendProgram=
#ResumeProgram=
#SuspendTimeout=
#ResumeTimeout=
#ResumeRate=
#SuspendExcNodes=
#SuspendExcParts=
#SuspendRate=
#SuspendTime=
#
#
# COMPUTE NODES
NodeName=control-machine CPUs=1 State=UNKNOWN
PartitionName=debug Nodes=control-machine Default=YES MaxTime=INFINITE
State=UP
slurmctld: error: Configured MailProg is invalid
slurmctld: error: ################################################
slurmctld: error: ### SEVERE SECURITY VULERABILTY ###
slurmctld: error: ### StateSaveLocation DIRECTORY IS WORLD WRITABLE ###
slurmctld: error: ### CORRECT FILE PERMISSIONS ###
slurmctld: error: ################################################
slurmctld: error: Could not open node state file /tmp/node_state: No
such file or directory
slurmctld: error: NOTE: Trying backup state save file. Information may
be lost!
slurmctld: No node state file (/tmp/node_state.old) to recover
slurmctld: error: Incomplete node data checkpoint file
slurmctld: Recovered state of 0 nodes
slurmctld: error: Could not open job state file /tmp/job_state: No such
file or directory
slurmctld: error: NOTE: Trying backup state save file. Jobs may be lost!
slurmctld: No job state file (/tmp/job_state.old) to recover
slurmctld: debug: Updating partition uid access list
No such file or directory
slurmctld: error: NOTE: Trying backup state save file. Reservations may
be lost
slurmctld: No reservation state file (/tmp/resv_state.old) to recover
slurmctld: Recovered state of 0 reservations
No such file or directory
slurmctld: error: NOTE: Trying backup state save file. Triggers may be lost!
slurmctld: No trigger state file (/tmp/trigger_state.old) to recover
slurmctld: error: Incomplete trigger data checkpoint file
slurmctld: State of 0 triggers recovered
slurmctld: read_slurm_conf: backup_controller not specified.
slurmctld: Reinitializing job accounting state
slurmctld: Running as primary controller
slurmctld: debug2: slurmctld listening on 0.0.0.0:6817 <http://0.0.0.0:6817>
slurmctld: debug: Spawning registration agent for control-machine 1 hosts
slurmctld: debug2: Spawning RPC agent for msg_type
REQUEST_NODE_REGISTRATION_STATUS
slurmctld: debug2: got 1 threads to send out
slurmctld: debug2: Tree head got back 0 looking for 1
slurmctld: debug3: Tree sending to control-machine
slurmctld: debug2: _slurm_connect failed: Connection refused
slurmctld: debug2: Error connecting slurm stream socket at
10.47.65.195:6818 <http://10.47.65.195:6818>: Connection refused
slurmctld: debug3: connect refused, retrying
slurmctld: debug2: _slurm_connect failed: Connection refused
slurmctld: debug2: Error connecting slurm stream socket at
10.47.65.195:6818 <http://10.47.65.195:6818>: Connection refused
slurmctld: debug2: _slurm_connect failed: Connection refused
slurmctld: debug2: Error connecting slurm stream socket at
10.47.65.195:6818 <http://10.47.65.195:6818>: Connection refused
slurmctld: debug3: problems with control-machine
slurmctld: debug2: Tree head got back 1
Communication connection failure
slurmctld: error: Nodes control-machine not responding
slurmctld: debug2: Error connecting slurm stream socket at
10.47.65.195:6818 <http://10.47.65.195:6818>: Connection refused
slurmctld: debug2: _slurm_connect failed: Connection refused
slurmctld: debug2: Error connecting slurm stream socket at
10.47.65.195:6818 <http://10.47.65.195:6818>: Connection refused
^Cslurmctld: Terminate signal (SIGINT or SIGTERM) received
slurmctld: debug: sched: slurmctld terminating
slurmctld: debug3: _slurmctld_rpc_mgr shutting down
slurmctld: Saving all slurm state
slurmctld: error: Could not open job state file /tmp/job_state: No such
file or directory
slurmctld: error: NOTE: Trying backup state save file. Jobs may be lost!
slurmctld: No job state file (/tmp/job_state.old) found
slurmctld: debug3: Writing job id 1 to header record of job_state file
slurmctld: debug4: unable to create link for /tmp/job_state ->
/tmp/job_state.old: No such file or directory
slurmctld: debug4: unable to create link for /tmp/node_state ->
/tmp/node_state.old: No such file or directory
slurmctld: debug4: unable to create link for /tmp/part_state ->
/tmp/part_state.old: No such file or directory
slurmctld: debug4: unable to create link for /tmp/resv_state ->
/tmp/resv_state.old: No such file or directory
slurmctld: debug2: _slurm_connect failed: Connection refused
slurmctld: debug2: Error connecting slurm stream socket at
10.47.65.195:6818 <http://10.47.65.195:6818>: Connection refused
slurmctld: debug3: problems with control-machine
slurmctld: debug2: Tree head got back 1
slurmctld: debug4: unable to create link for /tmp/trigger_state ->
/tmp/trigger_state.old: No such file or directory
slurmctld: debug4: unable to create link for /tmp/assoc_mgr_state ->
/tmp/assoc_mgr_state.old: No such file or directory
slurmctld: debug4: unable to create link for /tmp/assoc_usage ->
/tmp/assoc_usage.old: No such file or directory
slurmctld: debug4: unable to create link for /tmp/qos_usage ->
/tmp/qos_usage.old: No such file or directory
slurmctld: debug3: _slurmctld_background shutting down
slurmctld: Unable to remove pidfile '/var/run/slurmctld.pid': Permission
denied
I am trying to run this on a single machine.
Any suggestions?
Thanks!
-Monica
Have you tried generating your own on the web?
http://slurm.schedmd.com/configurator.easy.html
<http://slurm.schedmd.com/configurator.easy.html>
http://slurm.schedmd.com/configurator.html
<http://slurm.schedmd.com/configurator.html>
You might more appropriate results that way. :-)
Michael
On Thu, Oct 9, 2014 at 11:11 AM, Monica Marathe
Hi,
Can anyone send me a sample slurm.conf file? I am trying to
configure SLURM on a single machine only.
Thanks!
-Monica
--
- Monica Marathe
--
Senior HPC Systems Engineer
High-Performance Computing Services
Lawrence Berkeley National Laboratory
Bldg 50B-3209E W: 510-495-2687 <tel:510-495-2687>
MS 050B-3209 F: 510-486-8615 <tel:510-486-8615>
--
- Monica Marathe