Discussion:
Sample slurm.conf
Monica Marathe
2014-10-09 18:11:31 UTC
Permalink
Hi,

Can anyone send me a sample slurm.conf file? I am trying to configure SLURM
on a single machine only.

Thanks!
-Monica
--
- Monica Marathe
Michael Jennings
2014-10-09 18:17:00 UTC
Permalink
Have you tried generating your own on the web?

http://slurm.schedmd.com/configurator.easy.html
http://slurm.schedmd.com/configurator.html

You might more appropriate results that way. :-)

Michael
Post by Monica Marathe
Hi,
Can anyone send me a sample slurm.conf file? I am trying to configure
SLURM on a single machine only.
Thanks!
-Monica
--
- Monica Marathe
--
Michael Jennings <mej-/***@public.gmane.org>
Senior HPC Systems Engineer
High-Performance Computing Services
Lawrence Berkeley National Laboratory
Bldg 50B-3209E W: 510-495-2687
MS 050B-3209 F: 510-486-8615
Monica Marathe
2014-10-09 19:02:59 UTC
Permalink
Hey Michael,
I did build my configuration file:

# slurm.conf file generated by configurator.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
ControlMachine=control-machine
#ControlAddr=
#BackupController=
#BackupAddr=
#
AuthType=auth/munge
CacheGroups=0
#CheckpointType=checkpoint/none
CryptoType=crypto/munge
#DisableRootJobs=NO
#EnforcePartLimits=NO
#Epilog=
#EpilogSlurmctld=
#FirstJobId=1
#MaxJobId=999999
#GresTypes=
#GroupUpdateForce=0
#GroupUpdateTime=600
#JobCheckpointDir=/var/slurm/checkpoint
#JobCredentialPrivateKey=
#JobCredentialPublicCertificate=
#JobFileAppend=0
#JobRequeue=1
#JobSubmitPlugins=1
#KillOnBadExit=0
#Licenses=foo*4,bar
#MailProg=/bin/mail
#MaxJobCount=5000
#MaxStepCount=40000
#MaxTasksPerNode=128
MpiDefault=none
#MpiParams=ports=#-#
#PluginDir=
#PlugStackConfig=
#PrivateData=jobs
ProctrackType=proctrack/pgid
#Prolog=
#PrologSlurmctld=
#PropagatePrioProcess=0
#PropagateResourceLimits=
#PropagateResourceLimitsExcept=
ReturnToService=1
#SallocDefaultCommand=
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/tmp/slurmd
SlurmUser=slurm
#SlurmdUser=root
#SrunEpilog=
#SrunProlog=
StateSaveLocation=/tmp
SwitchType=switch/none
#TaskEpilog=
TaskPlugin=task/none
#TaskPluginParam=
#TaskProlog=
#TopologyPlugin=topology/tree
#TmpFs=/tmp
#TrackWCKey=no
#TreeWidth=
#UnkillableStepProgram=
#UsePAM=0
#
#
# TIMERS
#BatchStartTimeout=10
#CompleteWait=0
#EpilogMsgTime=2000
#GetEnvTimeout=2
#HealthCheckInterval=0
#HealthCheckProgram=
InactiveLimit=0
KillWait=30
#MessageTimeout=10
#ResvOverRun=0
MinJobAge=300
#OverTimeLimit=0
SlurmctldTimeout=120
SlurmdTimeout=300
#UnkillableStepTimeout=60
#VSizeFactor=0
Waittime=0
#
#
# SCHEDULING
#DefMemPerCPU=0
FastSchedule=1
#MaxMemPerCPU=0
#SchedulerRootFilter=1
#SchedulerTimeSlice=30
SchedulerType=sched/backfill
SchedulerPort=7321
SelectType=select/linear
#SelectTypeParameters=
#
#
# JOB PRIORITY
#PriorityType=priority/basic
#PriorityDecayHalfLife=
#PriorityCalcPeriod=
#PriorityFavorSmall=
#PriorityMaxAge=
#PriorityUsageResetPeriod=
#PriorityWeightAge=
#PriorityWeightFairshare=
#PriorityWeightJobSize=
#PriorityWeightPartition=
#PriorityWeightQOS=
#
#
# LOGGING AND ACCOUNTING
#AccountingStorageEnforce=0
#AccountingStorageHost=
#AccountingStorageLoc=
#AccountingStoragePass=
#AccountingStoragePort=
AccountingStorageType=accounting_storage/none
#AccountingStorageUser=
AccountingStoreJobComment=YES
ClusterName=cluster
#DebugFlags=
#JobCompHost=
#JobCompLoc=
#JobCompPass=
#JobCompPort=
JobCompType=jobcomp/none
#JobCompUser=
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/none
SlurmctldDebug=3
#SlurmctldLogFile=
SlurmdDebug=3
#SlurmdLogFile=
#SlurmSchedLogFile=
#SlurmSchedLogLevel=
#
#
# POWER SAVE SUPPORT FOR IDLE NODES (optional)
#SuspendProgram=
#ResumeProgram=
#SuspendTimeout=
#ResumeTimeout=
#ResumeRate=
#SuspendExcNodes=
#SuspendExcParts=
#SuspendRate=
#SuspendTime=
#
#
# COMPUTE NODES
NodeName=control-machine CPUs=1 State=UNKNOWN
PartitionName=debug Nodes=control-machine Default=YES MaxTime=INFINITE
State=UP

But I get the following errors when i run slurmctld:

slurmctld: error: Configured MailProg is invalid

slurmctld: error: ################################################
slurmctld: error: ### SEVERE SECURITY VULERABILTY ###
slurmctld: error: ### StateSaveLocation DIRECTORY IS WORLD WRITABLE ###
slurmctld: error: ### CORRECT FILE PERMISSIONS ###
slurmctld: error: ################################################
slurmctld: error: Could not open node state file /tmp/node_state: No such
file or directory
slurmctld: error: NOTE: Trying backup state save file. Information may be
lost!
slurmctld: No node state file (/tmp/node_state.old) to recover
slurmctld: error: Incomplete node data checkpoint file
slurmctld: Recovered state of 0 nodes
slurmctld: error: Could not open job state file /tmp/job_state: No such
file or directory
slurmctld: error: NOTE: Trying backup state save file. Jobs may be lost!
slurmctld: No job state file (/tmp/job_state.old) to recover
slurmctld: debug: Updating partition uid access list
slurmctld: error: Could not open reservation state file /tmp/resv_state: No
such file or directory
slurmctld: error: NOTE: Trying backup state save file. Reservations may be
lost
slurmctld: No reservation state file (/tmp/resv_state.old) to recover
slurmctld: Recovered state of 0 reservations
slurmctld: error: Could not open trigger state file /tmp/trigger_state: No
such file or directory
slurmctld: error: NOTE: Trying backup state save file. Triggers may be lost!
slurmctld: No trigger state file (/tmp/trigger_state.old) to recover
slurmctld: error: Incomplete trigger data checkpoint file
slurmctld: State of 0 triggers recovered
slurmctld: read_slurm_conf: backup_controller not specified.
slurmctld: Reinitializing job accounting state
slurmctld: Running as primary controller

slurmctld: debug2: slurmctld listening on 0.0.0.0:6817
slurmctld: debug: Spawning registration agent for control-machine 1 hosts
slurmctld: debug2: Spawning RPC agent for msg_type
REQUEST_NODE_REGISTRATION_STATUS
slurmctld: debug2: got 1 threads to send out
slurmctld: debug2: Tree head got back 0 looking for 1
slurmctld: debug3: Tree sending to control-machine
slurmctld: debug2: _slurm_connect failed: Connection refused
slurmctld: debug2: Error connecting slurm stream socket at 10.47.65.195:6818:
Connection refused
slurmctld: debug3: connect refused, retrying
slurmctld: debug2: _slurm_connect failed: Connection refused
slurmctld: debug2: Error connecting slurm stream socket at 10.47.65.195:6818:
Connection refused
slurmctld: debug2: _slurm_connect failed: Connection refused
slurmctld: debug2: Error connecting slurm stream socket at 10.47.65.195:6818:
Connection refused

slurmctld: debug3: problems with control-machine
slurmctld: debug2: Tree head got back 1
slurmctld: agent/is_node_resp: node:control-machine rpc:1001 :
Communication connection failure
slurmctld: error: Nodes control-machine not responding

slurmctld: debug2: Error connecting slurm stream socket at 10.47.65.195:6818:
Connection refused
slurmctld: debug2: _slurm_connect failed: Connection refused
slurmctld: debug2: Error connecting slurm stream socket at 10.47.65.195:6818:
Connection refused
^Cslurmctld: Terminate signal (SIGINT or SIGTERM) received
slurmctld: debug: sched: slurmctld terminating
slurmctld: debug3: _slurmctld_rpc_mgr shutting down
slurmctld: Saving all slurm state
slurmctld: error: Could not open job state file /tmp/job_state: No such
file or directory
slurmctld: error: NOTE: Trying backup state save file. Jobs may be lost!
slurmctld: No job state file (/tmp/job_state.old) found
slurmctld: debug3: Writing job id 1 to header record of job_state file
slurmctld: debug4: unable to create link for /tmp/job_state ->
/tmp/job_state.old: No such file or directory
slurmctld: debug4: unable to create link for /tmp/node_state ->
/tmp/node_state.old: No such file or directory
slurmctld: debug4: unable to create link for /tmp/part_state ->
/tmp/part_state.old: No such file or directory
slurmctld: debug4: unable to create link for /tmp/resv_state ->
/tmp/resv_state.old: No such file or directory
slurmctld: debug2: _slurm_connect failed: Connection refused
slurmctld: debug2: Error connecting slurm stream socket at 10.47.65.195:6818:
Connection refused
slurmctld: debug3: problems with control-machine
slurmctld: debug2: Tree head got back 1
slurmctld: debug4: unable to create link for /tmp/trigger_state ->
/tmp/trigger_state.old: No such file or directory
slurmctld: debug4: unable to create link for /tmp/assoc_mgr_state ->
/tmp/assoc_mgr_state.old: No such file or directory
slurmctld: debug4: unable to create link for /tmp/assoc_usage ->
/tmp/assoc_usage.old: No such file or directory
slurmctld: debug4: unable to create link for /tmp/qos_usage ->
/tmp/qos_usage.old: No such file or directory
slurmctld: debug3: _slurmctld_background shutting down
slurmctld: Unable to remove pidfile '/var/run/slurmctld.pid': Permission
denied

I am trying to run this on a single machine.
Any suggestions?

Thanks!
-Monica
Post by Michael Jennings
Have you tried generating your own on the web?
http://slurm.schedmd.com/configurator.easy.html
http://slurm.schedmd.com/configurator.html
You might more appropriate results that way. :-)
Michael
Post by Monica Marathe
Hi,
Can anyone send me a sample slurm.conf file? I am trying to configure
SLURM on a single machine only.
Thanks!
-Monica
--
- Monica Marathe
--
Senior HPC Systems Engineer
High-Performance Computing Services
Lawrence Berkeley National Laboratory
Bldg 50B-3209E W: 510-495-2687
MS 050B-3209 F: 510-486-8615
--
- Monica Marathe
Uwe Sauter
2014-10-09 19:09:41 UTC
Permalink
Did you check your firewall rules? As you can see you need to allow
connections on port 6818...
Post by Monica Marathe
Hey Michael,
# slurm.conf file generated by configurator.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
ControlMachine=control-machine
#ControlAddr=
#BackupController=
#BackupAddr=
#
AuthType=auth/munge
CacheGroups=0
#CheckpointType=checkpoint/none
CryptoType=crypto/munge
#DisableRootJobs=NO
#EnforcePartLimits=NO
#Epilog=
#EpilogSlurmctld=
#FirstJobId=1
#MaxJobId=999999
#GresTypes=
#GroupUpdateForce=0
#GroupUpdateTime=600
#JobCheckpointDir=/var/slurm/checkpoint
#JobCredentialPrivateKey=
#JobCredentialPublicCertificate=
#JobFileAppend=0
#JobRequeue=1
#JobSubmitPlugins=1
#KillOnBadExit=0
#Licenses=foo*4,bar
#MailProg=/bin/mail
#MaxJobCount=5000
#MaxStepCount=40000
#MaxTasksPerNode=128
MpiDefault=none
#MpiParams=ports=#-#
#PluginDir=
#PlugStackConfig=
#PrivateData=jobs
ProctrackType=proctrack/pgid
#Prolog=
#PrologSlurmctld=
#PropagatePrioProcess=0
#PropagateResourceLimits=
#PropagateResourceLimitsExcept=
ReturnToService=1
#SallocDefaultCommand=
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/tmp/slurmd
SlurmUser=slurm
#SlurmdUser=root
#SrunEpilog=
#SrunProlog=
StateSaveLocation=/tmp
SwitchType=switch/none
#TaskEpilog=
TaskPlugin=task/none
#TaskPluginParam=
#TaskProlog=
#TopologyPlugin=topology/tree
#TmpFs=/tmp
#TrackWCKey=no
#TreeWidth=
#UnkillableStepProgram=
#UsePAM=0
#
#
# TIMERS
#BatchStartTimeout=10
#CompleteWait=0
#EpilogMsgTime=2000
#GetEnvTimeout=2
#HealthCheckInterval=0
#HealthCheckProgram=
InactiveLimit=0
KillWait=30
#MessageTimeout=10
#ResvOverRun=0
MinJobAge=300
#OverTimeLimit=0
SlurmctldTimeout=120
SlurmdTimeout=300
#UnkillableStepTimeout=60
#VSizeFactor=0
Waittime=0
#
#
# SCHEDULING
#DefMemPerCPU=0
FastSchedule=1
#MaxMemPerCPU=0
#SchedulerRootFilter=1
#SchedulerTimeSlice=30
SchedulerType=sched/backfill
SchedulerPort=7321
SelectType=select/linear
#SelectTypeParameters=
#
#
# JOB PRIORITY
#PriorityType=priority/basic
#PriorityDecayHalfLife=
#PriorityCalcPeriod=
#PriorityFavorSmall=
#PriorityMaxAge=
#PriorityUsageResetPeriod=
#PriorityWeightAge=
#PriorityWeightFairshare=
#PriorityWeightJobSize=
#PriorityWeightPartition=
#PriorityWeightQOS=
#
#
# LOGGING AND ACCOUNTING
#AccountingStorageEnforce=0
#AccountingStorageHost=
#AccountingStorageLoc=
#AccountingStoragePass=
#AccountingStoragePort=
AccountingStorageType=accounting_storage/none
#AccountingStorageUser=
AccountingStoreJobComment=YES
ClusterName=cluster
#DebugFlags=
#JobCompHost=
#JobCompLoc=
#JobCompPass=
#JobCompPort=
JobCompType=jobcomp/none
#JobCompUser=
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/none
SlurmctldDebug=3
#SlurmctldLogFile=
SlurmdDebug=3
#SlurmdLogFile=
#SlurmSchedLogFile=
#SlurmSchedLogLevel=
#
#
# POWER SAVE SUPPORT FOR IDLE NODES (optional)
#SuspendProgram=
#ResumeProgram=
#SuspendTimeout=
#ResumeTimeout=
#ResumeRate=
#SuspendExcNodes=
#SuspendExcParts=
#SuspendRate=
#SuspendTime=
#
#
# COMPUTE NODES
NodeName=control-machine CPUs=1 State=UNKNOWN
PartitionName=debug Nodes=control-machine Default=YES MaxTime=INFINITE
State=UP
slurmctld: error: Configured MailProg is invalid
slurmctld: error: ################################################
slurmctld: error: ### SEVERE SECURITY VULERABILTY ###
slurmctld: error: ### StateSaveLocation DIRECTORY IS WORLD WRITABLE ###
slurmctld: error: ### CORRECT FILE PERMISSIONS ###
slurmctld: error: ################################################
slurmctld: error: Could not open node state file /tmp/node_state: No
such file or directory
slurmctld: error: NOTE: Trying backup state save file. Information may
be lost!
slurmctld: No node state file (/tmp/node_state.old) to recover
slurmctld: error: Incomplete node data checkpoint file
slurmctld: Recovered state of 0 nodes
slurmctld: error: Could not open job state file /tmp/job_state: No such
file or directory
slurmctld: error: NOTE: Trying backup state save file. Jobs may be lost!
slurmctld: No job state file (/tmp/job_state.old) to recover
slurmctld: debug: Updating partition uid access list
No such file or directory
slurmctld: error: NOTE: Trying backup state save file. Reservations may
be lost
slurmctld: No reservation state file (/tmp/resv_state.old) to recover
slurmctld: Recovered state of 0 reservations
No such file or directory
slurmctld: error: NOTE: Trying backup state save file. Triggers may be lost!
slurmctld: No trigger state file (/tmp/trigger_state.old) to recover
slurmctld: error: Incomplete trigger data checkpoint file
slurmctld: State of 0 triggers recovered
slurmctld: read_slurm_conf: backup_controller not specified.
slurmctld: Reinitializing job accounting state
slurmctld: Running as primary controller
slurmctld: debug2: slurmctld listening on 0.0.0.0:6817 <http://0.0.0.0:6817>
slurmctld: debug: Spawning registration agent for control-machine 1 hosts
slurmctld: debug2: Spawning RPC agent for msg_type
REQUEST_NODE_REGISTRATION_STATUS
slurmctld: debug2: got 1 threads to send out
slurmctld: debug2: Tree head got back 0 looking for 1
slurmctld: debug3: Tree sending to control-machine
slurmctld: debug2: _slurm_connect failed: Connection refused
slurmctld: debug2: Error connecting slurm stream socket at
10.47.65.195:6818 <http://10.47.65.195:6818>: Connection refused
slurmctld: debug3: connect refused, retrying
slurmctld: debug2: _slurm_connect failed: Connection refused
slurmctld: debug2: Error connecting slurm stream socket at
10.47.65.195:6818 <http://10.47.65.195:6818>: Connection refused
slurmctld: debug2: _slurm_connect failed: Connection refused
slurmctld: debug2: Error connecting slurm stream socket at
10.47.65.195:6818 <http://10.47.65.195:6818>: Connection refused
slurmctld: debug3: problems with control-machine
slurmctld: debug2: Tree head got back 1
Communication connection failure
slurmctld: error: Nodes control-machine not responding
slurmctld: debug2: Error connecting slurm stream socket at
10.47.65.195:6818 <http://10.47.65.195:6818>: Connection refused
slurmctld: debug2: _slurm_connect failed: Connection refused
slurmctld: debug2: Error connecting slurm stream socket at
10.47.65.195:6818 <http://10.47.65.195:6818>: Connection refused
^Cslurmctld: Terminate signal (SIGINT or SIGTERM) received
slurmctld: debug: sched: slurmctld terminating
slurmctld: debug3: _slurmctld_rpc_mgr shutting down
slurmctld: Saving all slurm state
slurmctld: error: Could not open job state file /tmp/job_state: No such
file or directory
slurmctld: error: NOTE: Trying backup state save file. Jobs may be lost!
slurmctld: No job state file (/tmp/job_state.old) found
slurmctld: debug3: Writing job id 1 to header record of job_state file
slurmctld: debug4: unable to create link for /tmp/job_state ->
/tmp/job_state.old: No such file or directory
slurmctld: debug4: unable to create link for /tmp/node_state ->
/tmp/node_state.old: No such file or directory
slurmctld: debug4: unable to create link for /tmp/part_state ->
/tmp/part_state.old: No such file or directory
slurmctld: debug4: unable to create link for /tmp/resv_state ->
/tmp/resv_state.old: No such file or directory
slurmctld: debug2: _slurm_connect failed: Connection refused
slurmctld: debug2: Error connecting slurm stream socket at
10.47.65.195:6818 <http://10.47.65.195:6818>: Connection refused
slurmctld: debug3: problems with control-machine
slurmctld: debug2: Tree head got back 1
slurmctld: debug4: unable to create link for /tmp/trigger_state ->
/tmp/trigger_state.old: No such file or directory
slurmctld: debug4: unable to create link for /tmp/assoc_mgr_state ->
/tmp/assoc_mgr_state.old: No such file or directory
slurmctld: debug4: unable to create link for /tmp/assoc_usage ->
/tmp/assoc_usage.old: No such file or directory
slurmctld: debug4: unable to create link for /tmp/qos_usage ->
/tmp/qos_usage.old: No such file or directory
slurmctld: debug3: _slurmctld_background shutting down
slurmctld: Unable to remove pidfile '/var/run/slurmctld.pid': Permission
denied
I am trying to run this on a single machine.
Any suggestions?
Thanks!
-Monica
Have you tried generating your own on the web?
http://slurm.schedmd.com/configurator.easy.html
<http://slurm.schedmd.com/configurator.easy.html>
http://slurm.schedmd.com/configurator.html
<http://slurm.schedmd.com/configurator.html>
You might more appropriate results that way. :-)
Michael
On Thu, Oct 9, 2014 at 11:11 AM, Monica Marathe
Hi,
Can anyone send me a sample slurm.conf file? I am trying to
configure SLURM on a single machine only.
Thanks!
-Monica
--
- Monica Marathe
--
Senior HPC Systems Engineer
High-Performance Computing Services
Lawrence Berkeley National Laboratory
Bldg 50B-3209E W: 510-495-2687 <tel:510-495-2687>
MS 050B-3209 F: 510-486-8615 <tel:510-486-8615>
--
- Monica Marathe
Uwe Sauter
2014-10-09 19:11:40 UTC
Permalink
And port 6817 as well
Post by Monica Marathe
Hey Michael,
# slurm.conf file generated by configurator.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
ControlMachine=control-machine
#ControlAddr=
#BackupController=
#BackupAddr=
#
AuthType=auth/munge
CacheGroups=0
#CheckpointType=checkpoint/none
CryptoType=crypto/munge
#DisableRootJobs=NO
#EnforcePartLimits=NO
#Epilog=
#EpilogSlurmctld=
#FirstJobId=1
#MaxJobId=999999
#GresTypes=
#GroupUpdateForce=0
#GroupUpdateTime=600
#JobCheckpointDir=/var/slurm/checkpoint
#JobCredentialPrivateKey=
#JobCredentialPublicCertificate=
#JobFileAppend=0
#JobRequeue=1
#JobSubmitPlugins=1
#KillOnBadExit=0
#Licenses=foo*4,bar
#MailProg=/bin/mail
#MaxJobCount=5000
#MaxStepCount=40000
#MaxTasksPerNode=128
MpiDefault=none
#MpiParams=ports=#-#
#PluginDir=
#PlugStackConfig=
#PrivateData=jobs
ProctrackType=proctrack/pgid
#Prolog=
#PrologSlurmctld=
#PropagatePrioProcess=0
#PropagateResourceLimits=
#PropagateResourceLimitsExcept=
ReturnToService=1
#SallocDefaultCommand=
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/tmp/slurmd
SlurmUser=slurm
#SlurmdUser=root
#SrunEpilog=
#SrunProlog=
StateSaveLocation=/tmp
SwitchType=switch/none
#TaskEpilog=
TaskPlugin=task/none
#TaskPluginParam=
#TaskProlog=
#TopologyPlugin=topology/tree
#TmpFs=/tmp
#TrackWCKey=no
#TreeWidth=
#UnkillableStepProgram=
#UsePAM=0
#
#
# TIMERS
#BatchStartTimeout=10
#CompleteWait=0
#EpilogMsgTime=2000
#GetEnvTimeout=2
#HealthCheckInterval=0
#HealthCheckProgram=
InactiveLimit=0
KillWait=30
#MessageTimeout=10
#ResvOverRun=0
MinJobAge=300
#OverTimeLimit=0
SlurmctldTimeout=120
SlurmdTimeout=300
#UnkillableStepTimeout=60
#VSizeFactor=0
Waittime=0
#
#
# SCHEDULING
#DefMemPerCPU=0
FastSchedule=1
#MaxMemPerCPU=0
#SchedulerRootFilter=1
#SchedulerTimeSlice=30
SchedulerType=sched/backfill
SchedulerPort=7321
SelectType=select/linear
#SelectTypeParameters=
#
#
# JOB PRIORITY
#PriorityType=priority/basic
#PriorityDecayHalfLife=
#PriorityCalcPeriod=
#PriorityFavorSmall=
#PriorityMaxAge=
#PriorityUsageResetPeriod=
#PriorityWeightAge=
#PriorityWeightFairshare=
#PriorityWeightJobSize=
#PriorityWeightPartition=
#PriorityWeightQOS=
#
#
# LOGGING AND ACCOUNTING
#AccountingStorageEnforce=0
#AccountingStorageHost=
#AccountingStorageLoc=
#AccountingStoragePass=
#AccountingStoragePort=
AccountingStorageType=accounting_storage/none
#AccountingStorageUser=
AccountingStoreJobComment=YES
ClusterName=cluster
#DebugFlags=
#JobCompHost=
#JobCompLoc=
#JobCompPass=
#JobCompPort=
JobCompType=jobcomp/none
#JobCompUser=
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/none
SlurmctldDebug=3
#SlurmctldLogFile=
SlurmdDebug=3
#SlurmdLogFile=
#SlurmSchedLogFile=
#SlurmSchedLogLevel=
#
#
# POWER SAVE SUPPORT FOR IDLE NODES (optional)
#SuspendProgram=
#ResumeProgram=
#SuspendTimeout=
#ResumeTimeout=
#ResumeRate=
#SuspendExcNodes=
#SuspendExcParts=
#SuspendRate=
#SuspendTime=
#
#
# COMPUTE NODES
NodeName=control-machine CPUs=1 State=UNKNOWN
PartitionName=debug Nodes=control-machine Default=YES MaxTime=INFINITE
State=UP
slurmctld: error: Configured MailProg is invalid
slurmctld: error: ################################################
slurmctld: error: ### SEVERE SECURITY VULERABILTY ###
slurmctld: error: ### StateSaveLocation DIRECTORY IS WORLD WRITABLE ###
slurmctld: error: ### CORRECT FILE PERMISSIONS ###
slurmctld: error: ################################################
slurmctld: error: Could not open node state file /tmp/node_state: No
such file or directory
slurmctld: error: NOTE: Trying backup state save file. Information may
be lost!
slurmctld: No node state file (/tmp/node_state.old) to recover
slurmctld: error: Incomplete node data checkpoint file
slurmctld: Recovered state of 0 nodes
slurmctld: error: Could not open job state file /tmp/job_state: No such
file or directory
slurmctld: error: NOTE: Trying backup state save file. Jobs may be lost!
slurmctld: No job state file (/tmp/job_state.old) to recover
slurmctld: debug: Updating partition uid access list
No such file or directory
slurmctld: error: NOTE: Trying backup state save file. Reservations may
be lost
slurmctld: No reservation state file (/tmp/resv_state.old) to recover
slurmctld: Recovered state of 0 reservations
No such file or directory
slurmctld: error: NOTE: Trying backup state save file. Triggers may be lost!
slurmctld: No trigger state file (/tmp/trigger_state.old) to recover
slurmctld: error: Incomplete trigger data checkpoint file
slurmctld: State of 0 triggers recovered
slurmctld: read_slurm_conf: backup_controller not specified.
slurmctld: Reinitializing job accounting state
slurmctld: Running as primary controller
slurmctld: debug2: slurmctld listening on 0.0.0.0:6817 <http://0.0.0.0:6817>
slurmctld: debug: Spawning registration agent for control-machine 1 hosts
slurmctld: debug2: Spawning RPC agent for msg_type
REQUEST_NODE_REGISTRATION_STATUS
slurmctld: debug2: got 1 threads to send out
slurmctld: debug2: Tree head got back 0 looking for 1
slurmctld: debug3: Tree sending to control-machine
slurmctld: debug2: _slurm_connect failed: Connection refused
slurmctld: debug2: Error connecting slurm stream socket at
10.47.65.195:6818 <http://10.47.65.195:6818>: Connection refused
slurmctld: debug3: connect refused, retrying
slurmctld: debug2: _slurm_connect failed: Connection refused
slurmctld: debug2: Error connecting slurm stream socket at
10.47.65.195:6818 <http://10.47.65.195:6818>: Connection refused
slurmctld: debug2: _slurm_connect failed: Connection refused
slurmctld: debug2: Error connecting slurm stream socket at
10.47.65.195:6818 <http://10.47.65.195:6818>: Connection refused
slurmctld: debug3: problems with control-machine
slurmctld: debug2: Tree head got back 1
Communication connection failure
slurmctld: error: Nodes control-machine not responding
slurmctld: debug2: Error connecting slurm stream socket at
10.47.65.195:6818 <http://10.47.65.195:6818>: Connection refused
slurmctld: debug2: _slurm_connect failed: Connection refused
slurmctld: debug2: Error connecting slurm stream socket at
10.47.65.195:6818 <http://10.47.65.195:6818>: Connection refused
^Cslurmctld: Terminate signal (SIGINT or SIGTERM) received
slurmctld: debug: sched: slurmctld terminating
slurmctld: debug3: _slurmctld_rpc_mgr shutting down
slurmctld: Saving all slurm state
slurmctld: error: Could not open job state file /tmp/job_state: No such
file or directory
slurmctld: error: NOTE: Trying backup state save file. Jobs may be lost!
slurmctld: No job state file (/tmp/job_state.old) found
slurmctld: debug3: Writing job id 1 to header record of job_state file
slurmctld: debug4: unable to create link for /tmp/job_state ->
/tmp/job_state.old: No such file or directory
slurmctld: debug4: unable to create link for /tmp/node_state ->
/tmp/node_state.old: No such file or directory
slurmctld: debug4: unable to create link for /tmp/part_state ->
/tmp/part_state.old: No such file or directory
slurmctld: debug4: unable to create link for /tmp/resv_state ->
/tmp/resv_state.old: No such file or directory
slurmctld: debug2: _slurm_connect failed: Connection refused
slurmctld: debug2: Error connecting slurm stream socket at
10.47.65.195:6818 <http://10.47.65.195:6818>: Connection refused
slurmctld: debug3: problems with control-machine
slurmctld: debug2: Tree head got back 1
slurmctld: debug4: unable to create link for /tmp/trigger_state ->
/tmp/trigger_state.old: No such file or directory
slurmctld: debug4: unable to create link for /tmp/assoc_mgr_state ->
/tmp/assoc_mgr_state.old: No such file or directory
slurmctld: debug4: unable to create link for /tmp/assoc_usage ->
/tmp/assoc_usage.old: No such file or directory
slurmctld: debug4: unable to create link for /tmp/qos_usage ->
/tmp/qos_usage.old: No such file or directory
slurmctld: debug3: _slurmctld_background shutting down
slurmctld: Unable to remove pidfile '/var/run/slurmctld.pid': Permission
denied
I am trying to run this on a single machine.
Any suggestions?
Thanks!
-Monica
Have you tried generating your own on the web?
http://slurm.schedmd.com/configurator.easy.html
<http://slurm.schedmd.com/configurator.easy.html>
http://slurm.schedmd.com/configurator.html
<http://slurm.schedmd.com/configurator.html>
You might more appropriate results that way. :-)
Michael
On Thu, Oct 9, 2014 at 11:11 AM, Monica Marathe
Hi,
Can anyone send me a sample slurm.conf file? I am trying to
configure SLURM on a single machine only.
Thanks!
-Monica
--
- Monica Marathe
--
Senior HPC Systems Engineer
High-Performance Computing Services
Lawrence Berkeley National Laboratory
Bldg 50B-3209E W: 510-495-2687 <tel:510-495-2687>
MS 050B-3209 F: 510-486-8615 <tel:510-486-8615>
--
- Monica Marathe
Chrysovalantis Paschoulas
2014-10-10 08:02:04 UTC
Permalink
Hello!

I would suggest you to do the following steps:

1) Configure on your server a correct mailprog in order to get email
notifications. (optional)

2) Change your "StateSaveLocation" parameter, don't use the /tmp. Create
another directory and give write permissions only to the SlurmUser (user
who runs slurmctld daemon).

3) Use both ControlMachine and ControlAddr options. Set
ControlMachine=server-hostname and ControlAddr=server-ip, use hostname
and ip that are visible/accessible on the same network as the compute nodes.

4) You could try to set: ProctrackType=proctrack/linuxproc

5) The directories for the pid and log files should be writable by the
slurmuser.

6) Before trying to use slurmdbd, at least set up the jobcomp mechanism
in order to have a history of old jobs. Just set
JobCompType="jobcomp/filetxt" and JobCompLoc="<dir>" where jobcomploc
should be writable by the slurmuser.

7) You could also change to JobAcctGatherFrequency=0

8) You should check also your firewall.

9) I see that you are trying to run slurmctld and slurmd on the same
machine. I have never tried this but could you tell us if you can see
both processes to be running? When you run "service slurm start" which
processes are started? In the end I would suggest you to use a different
machine as a compute node. If you use VMs then is shouldn't be difficult
to set up a second VM ;)

I hope this will help you!

Best Regards,
Chrysovalantis Paschoulas
Post by Uwe Sauter
And port 6817 as well
Post by Monica Marathe
Hey Michael,
# slurm.conf file generated by configurator.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
ControlMachine=control-machine
#ControlAddr=
#BackupController=
#BackupAddr=
#
AuthType=auth/munge
CacheGroups=0
#CheckpointType=checkpoint/none
CryptoType=crypto/munge
#DisableRootJobs=NO
#EnforcePartLimits=NO
#Epilog=
#EpilogSlurmctld=
#FirstJobId=1
#MaxJobId=999999
#GresTypes=
#GroupUpdateForce=0
#GroupUpdateTime=600
#JobCheckpointDir=/var/slurm/checkpoint
#JobCredentialPrivateKey=
#JobCredentialPublicCertificate=
#JobFileAppend=0
#JobRequeue=1
#JobSubmitPlugins=1
#KillOnBadExit=0
#Licenses=foo*4,bar
#MailProg=/bin/mail
#MaxJobCount=5000
#MaxStepCount=40000
#MaxTasksPerNode=128
MpiDefault=none
#MpiParams=ports=#-#
#PluginDir=
#PlugStackConfig=
#PrivateData=jobs
ProctrackType=proctrack/pgid
#Prolog=
#PrologSlurmctld=
#PropagatePrioProcess=0
#PropagateResourceLimits=
#PropagateResourceLimitsExcept=
ReturnToService=1
#SallocDefaultCommand=
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/tmp/slurmd
SlurmUser=slurm
#SlurmdUser=root
#SrunEpilog=
#SrunProlog=
StateSaveLocation=/tmp
SwitchType=switch/none
#TaskEpilog=
TaskPlugin=task/none
#TaskPluginParam=
#TaskProlog=
#TopologyPlugin=topology/tree
#TmpFs=/tmp
#TrackWCKey=no
#TreeWidth=
#UnkillableStepProgram=
#UsePAM=0
#
#
# TIMERS
#BatchStartTimeout=10
#CompleteWait=0
#EpilogMsgTime=2000
#GetEnvTimeout=2
#HealthCheckInterval=0
#HealthCheckProgram=
InactiveLimit=0
KillWait=30
#MessageTimeout=10
#ResvOverRun=0
MinJobAge=300
#OverTimeLimit=0
SlurmctldTimeout=120
SlurmdTimeout=300
#UnkillableStepTimeout=60
#VSizeFactor=0
Waittime=0
#
#
# SCHEDULING
#DefMemPerCPU=0
FastSchedule=1
#MaxMemPerCPU=0
#SchedulerRootFilter=1
#SchedulerTimeSlice=30
SchedulerType=sched/backfill
SchedulerPort=7321
SelectType=select/linear
#SelectTypeParameters=
#
#
# JOB PRIORITY
#PriorityType=priority/basic
#PriorityDecayHalfLife=
#PriorityCalcPeriod=
#PriorityFavorSmall=
#PriorityMaxAge=
#PriorityUsageResetPeriod=
#PriorityWeightAge=
#PriorityWeightFairshare=
#PriorityWeightJobSize=
#PriorityWeightPartition=
#PriorityWeightQOS=
#
#
# LOGGING AND ACCOUNTING
#AccountingStorageEnforce=0
#AccountingStorageHost=
#AccountingStorageLoc=
#AccountingStoragePass=
#AccountingStoragePort=
AccountingStorageType=accounting_storage/none
#AccountingStorageUser=
AccountingStoreJobComment=YES
ClusterName=cluster
#DebugFlags=
#JobCompHost=
#JobCompLoc=
#JobCompPass=
#JobCompPort=
JobCompType=jobcomp/none
#JobCompUser=
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/none
SlurmctldDebug=3
#SlurmctldLogFile=
SlurmdDebug=3
#SlurmdLogFile=
#SlurmSchedLogFile=
#SlurmSchedLogLevel=
#
#
# POWER SAVE SUPPORT FOR IDLE NODES (optional)
#SuspendProgram=
#ResumeProgram=
#SuspendTimeout=
#ResumeTimeout=
#ResumeRate=
#SuspendExcNodes=
#SuspendExcParts=
#SuspendRate=
#SuspendTime=
#
#
# COMPUTE NODES
NodeName=control-machine CPUs=1 State=UNKNOWN
PartitionName=debug Nodes=control-machine Default=YES MaxTime=INFINITE
State=UP
slurmctld: error: Configured MailProg is invalid
slurmctld: error: ################################################
slurmctld: error: ### SEVERE SECURITY VULERABILTY ###
slurmctld: error: ### StateSaveLocation DIRECTORY IS WORLD WRITABLE ###
slurmctld: error: ### CORRECT FILE PERMISSIONS ###
slurmctld: error: ################################################
slurmctld: error: Could not open node state file /tmp/node_state: No
such file or directory
slurmctld: error: NOTE: Trying backup state save file. Information may
be lost!
slurmctld: No node state file (/tmp/node_state.old) to recover
slurmctld: error: Incomplete node data checkpoint file
slurmctld: Recovered state of 0 nodes
slurmctld: error: Could not open job state file /tmp/job_state: No such
file or directory
slurmctld: error: NOTE: Trying backup state save file. Jobs may be lost!
slurmctld: No job state file (/tmp/job_state.old) to recover
slurmctld: debug: Updating partition uid access list
No such file or directory
slurmctld: error: NOTE: Trying backup state save file. Reservations may
be lost
slurmctld: No reservation state file (/tmp/resv_state.old) to recover
slurmctld: Recovered state of 0 reservations
No such file or directory
slurmctld: error: NOTE: Trying backup state save file. Triggers may be lost!
slurmctld: No trigger state file (/tmp/trigger_state.old) to recover
slurmctld: error: Incomplete trigger data checkpoint file
slurmctld: State of 0 triggers recovered
slurmctld: read_slurm_conf: backup_controller not specified.
slurmctld: Reinitializing job accounting state
slurmctld: Running as primary controller
slurmctld: debug2: slurmctld listening on 0.0.0.0:6817 <http://0.0.0.0:6817>
slurmctld: debug: Spawning registration agent for control-machine 1 hosts
slurmctld: debug2: Spawning RPC agent for msg_type
REQUEST_NODE_REGISTRATION_STATUS
slurmctld: debug2: got 1 threads to send out
slurmctld: debug2: Tree head got back 0 looking for 1
slurmctld: debug3: Tree sending to control-machine
slurmctld: debug2: _slurm_connect failed: Connection refused
slurmctld: debug2: Error connecting slurm stream socket at
10.47.65.195:6818 <http://10.47.65.195:6818>: Connection refused
slurmctld: debug3: connect refused, retrying
slurmctld: debug2: _slurm_connect failed: Connection refused
slurmctld: debug2: Error connecting slurm stream socket at
10.47.65.195:6818 <http://10.47.65.195:6818>: Connection refused
slurmctld: debug2: _slurm_connect failed: Connection refused
slurmctld: debug2: Error connecting slurm stream socket at
10.47.65.195:6818 <http://10.47.65.195:6818>: Connection refused
slurmctld: debug3: problems with control-machine
slurmctld: debug2: Tree head got back 1
Communication connection failure
slurmctld: error: Nodes control-machine not responding
slurmctld: debug2: Error connecting slurm stream socket at
10.47.65.195:6818 <http://10.47.65.195:6818>: Connection refused
slurmctld: debug2: _slurm_connect failed: Connection refused
slurmctld: debug2: Error connecting slurm stream socket at
10.47.65.195:6818 <http://10.47.65.195:6818>: Connection refused
^Cslurmctld: Terminate signal (SIGINT or SIGTERM) received
slurmctld: debug: sched: slurmctld terminating
slurmctld: debug3: _slurmctld_rpc_mgr shutting down
slurmctld: Saving all slurm state
slurmctld: error: Could not open job state file /tmp/job_state: No such
file or directory
slurmctld: error: NOTE: Trying backup state save file. Jobs may be lost!
slurmctld: No job state file (/tmp/job_state.old) found
slurmctld: debug3: Writing job id 1 to header record of job_state file
slurmctld: debug4: unable to create link for /tmp/job_state ->
/tmp/job_state.old: No such file or directory
slurmctld: debug4: unable to create link for /tmp/node_state ->
/tmp/node_state.old: No such file or directory
slurmctld: debug4: unable to create link for /tmp/part_state ->
/tmp/part_state.old: No such file or directory
slurmctld: debug4: unable to create link for /tmp/resv_state ->
/tmp/resv_state.old: No such file or directory
slurmctld: debug2: _slurm_connect failed: Connection refused
slurmctld: debug2: Error connecting slurm stream socket at
10.47.65.195:6818 <http://10.47.65.195:6818>: Connection refused
slurmctld: debug3: problems with control-machine
slurmctld: debug2: Tree head got back 1
slurmctld: debug4: unable to create link for /tmp/trigger_state ->
/tmp/trigger_state.old: No such file or directory
slurmctld: debug4: unable to create link for /tmp/assoc_mgr_state ->
/tmp/assoc_mgr_state.old: No such file or directory
slurmctld: debug4: unable to create link for /tmp/assoc_usage ->
/tmp/assoc_usage.old: No such file or directory
slurmctld: debug4: unable to create link for /tmp/qos_usage ->
/tmp/qos_usage.old: No such file or directory
slurmctld: debug3: _slurmctld_background shutting down
slurmctld: Unable to remove pidfile '/var/run/slurmctld.pid': Permission
denied
I am trying to run this on a single machine.
Any suggestions?
Thanks!
-Monica
Have you tried generating your own on the web?
http://slurm.schedmd.com/configurator.easy.html
<http://slurm.schedmd.com/configurator.easy.html>
http://slurm.schedmd.com/configurator.html
<http://slurm.schedmd.com/configurator.html>
You might more appropriate results that way. :-)
Michael
On Thu, Oct 9, 2014 at 11:11 AM, Monica Marathe
Hi,
Can anyone send me a sample slurm.conf file? I am trying to
configure SLURM on a single machine only.
Thanks!
-Monica
--
- Monica Marathe
--
Senior HPC Systems Engineer
High-Performance Computing Services
Lawrence Berkeley National Laboratory
Bldg 50B-3209E W: 510-495-2687 <tel:510-495-2687>
MS 050B-3209 F: 510-486-8615 <tel:510-486-8615>
--
- Monica Marathe
------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------
Forschungszentrum Juelich GmbH
52425 Juelich
Sitz der Gesellschaft: Juelich
Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498
Vorsitzender des Aufsichtsrats: MinDir Dr. Karl Eugen Huthmacher
Geschaeftsfuehrung: Prof. Dr.-Ing. Wolfgang Marquardt (Vorsitzender),
Karsten Beneke (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt,
Prof. Dr. Sebastian M. Schmidt
------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------
Loading...