Uwe Sauter
2014-09-02 14:29:32 UTC
Hi all,
I'm a bit confused by the explanation of the "BatchStartTimeout" option.
It states:
"Specifies how long to wait after a batch job start request is issued
before we expect the batch job to be running on the compute node.
Depending upon how nodes are returned to service, this value may need to
be increased above its default value of 10 seconds."
It is unclear from which point in time this timeout gets counted. Some
possibilities:
- when a batch job was submitted
- when SLURM executes the ResumeProgram command
- when the node's slurm daemon contacts the controller daemon
Can someone reword the explanation or give details about this option?
Are there recommendations, e.g. linked to ResumeTimeout?
Thanks,
Uwe
I'm a bit confused by the explanation of the "BatchStartTimeout" option.
It states:
"Specifies how long to wait after a batch job start request is issued
before we expect the batch job to be running on the compute node.
Depending upon how nodes are returned to service, this value may need to
be increased above its default value of 10 seconds."
It is unclear from which point in time this timeout gets counted. Some
possibilities:
- when a batch job was submitted
- when SLURM executes the ResumeProgram command
- when the node's slurm daemon contacts the controller daemon
Can someone reword the explanation or give details about this option?
Are there recommendations, e.g. linked to ResumeTimeout?
Thanks,
Uwe