Discussion:
slurmctld: Recheck if memory is reserved resource
Dorian Krause
2014-10-10 12:33:34 UTC
Permalink
This commit fixes a bug we observed when combining select/linear with
gres. If an allocation was requested with a --gres argument an srun
execution within that allocation would stall indefinitely:

-bash-4.1$ salloc -N 1 --gres=gpfs:100
salloc: Granted job allocation 384049
bash-4.1$ srun -w j3c017 -n 1 hostname
srun: Job step creation temporarily disabled, retrying

The slurmctld log showed:

debug3: StepDesc: user_id=10034 job_id=384049 node_count=1-1 cpu_count=1
debug3: cpu_freq=4294967294 num_tasks=1 relative=65534 task_dist=1 node_list=j3c017
debug3: host=j3l02 port=33608 name=hostname network=(null) exclusive=0
debug3: checkpoint-dir=/home/user checkpoint_int=0
debug3: mem_per_node=62720 resv_port_cnt=65534 immediate=0 no_kill=0
debug3: overcommit=0 time_limit=0 gres=(null) constraints=(null)
debug: Configuration for job 384049 complete
_pick_step_nodes: some requested nodes j3c017 still have memory used by other steps
_slurm_rpc_job_step_create for job 384049: Requested nodes are busy

If srun --exclusive would have be used instead everything would work fine.
The reason is that in exclusive mode the code properly checks whether memory
is a reserved resource in the _pick_step_node() function.
This commit modifies the alternate code path to do the same.
---
src/slurmctld/step_mgr.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/src/slurmctld/step_mgr.c b/src/slurmctld/step_mgr.c
index b62a06d..7cce3e8 100644
--- a/src/slurmctld/step_mgr.c
+++ b/src/slurmctld/step_mgr.c
@@ -1174,7 +1174,8 @@ _pick_step_nodes (struct job_record *job_ptr,

total_cpus = job_resrcs_ptr->cpus[node_inx];
usable_cpu_cnt[i] = avail_cpus = total_cpus;
- if (step_spec->pn_min_memory & MEM_PER_CPU) {
+ if (_is_mem_resv() &&
+ step_spec->pn_min_memory & MEM_PER_CPU) {
uint32_t mem_use = step_spec->pn_min_memory;
mem_use &= (~MEM_PER_CPU);
/* ignore current step allocations */
@@ -1191,7 +1192,7 @@ _pick_step_nodes (struct job_record *job_ptr,
usable_cpu_cnt[i] = avail_cpus;
fail_mode = ESLURM_INVALID_TASK_MEMORY;
}
- } else if (step_spec->pn_min_memory) {
+ } else if (_is_mem_resv() && step_spec->pn_min_memory) {
uint32_t mem_use = step_spec->pn_min_memory;
/* ignore current step allocations */
tmp_mem = job_resrcs_ptr->
-- 1.9.3



------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------
Forschungszentrum Juelich GmbH
52425 Juelich
Sitz der Gesellschaft: Juelich
Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498
Vorsitzender des Aufsichtsrats: MinDir Dr. Karl Eugen Huthmacher
Geschaeftsfuehrung: Prof. Dr.-Ing. Wolfgang Marquardt (Vorsitzender),
Karsten Beneke (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt,
Prof. Dr. Sebastian M. Schmidt
------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------
j***@public.gmane.org
2014-10-10 17:29:43 UTC
Permalink
The patch has been committed here:
https://github.com/SchedMD/slurm/commit/0dd124692890ddb187abac56f779770e12d38baa

Thanks!
Post by Dorian Krause
This commit fixes a bug we observed when combining select/linear with
gres. If an allocation was requested with a --gres argument an srun
-bash-4.1$ salloc -N 1 --gres=gpfs:100
salloc: Granted job allocation 384049
bash-4.1$ srun -w j3c017 -n 1 hostname
srun: Job step creation temporarily disabled, retrying
debug3: StepDesc: user_id=10034 job_id=384049 node_count=1-1 cpu_count=1
debug3: cpu_freq=4294967294 num_tasks=1 relative=65534
task_dist=1 node_list=j3c017
debug3: host=j3l02 port=33608 name=hostname network=(null) exclusive=0
debug3: checkpoint-dir=/home/user checkpoint_int=0
debug3: mem_per_node=62720 resv_port_cnt=65534 immediate=0 no_kill=0
debug3: overcommit=0 time_limit=0 gres=(null) constraints=(null)
debug: Configuration for job 384049 complete
_pick_step_nodes: some requested nodes j3c017 still have memory used by other steps
_slurm_rpc_job_step_create for job 384049: Requested nodes are busy
If srun --exclusive would have be used instead everything would work fine.
The reason is that in exclusive mode the code properly checks whether memory
is a reserved resource in the _pick_step_node() function.
This commit modifies the alternate code path to do the same.
---
src/slurmctld/step_mgr.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/src/slurmctld/step_mgr.c b/src/slurmctld/step_mgr.c
index b62a06d..7cce3e8 100644
--- a/src/slurmctld/step_mgr.c
+++ b/src/slurmctld/step_mgr.c
@@ -1174,7 +1174,8 @@ _pick_step_nodes (struct job_record *job_ptr,
total_cpus = job_resrcs_ptr->cpus[node_inx];
usable_cpu_cnt[i] = avail_cpus = total_cpus;
- if (step_spec->pn_min_memory & MEM_PER_CPU) {
+ if (_is_mem_resv() &&
+ step_spec->pn_min_memory & MEM_PER_CPU) {
uint32_t mem_use = step_spec->pn_min_memory;
mem_use &= (~MEM_PER_CPU);
/* ignore current step allocations */
@@ -1191,7 +1192,7 @@ _pick_step_nodes (struct job_record *job_ptr,
usable_cpu_cnt[i] = avail_cpus;
fail_mode =
ESLURM_INVALID_TASK_MEMORY;
}
- } else if (step_spec->pn_min_memory) {
+ } else if (_is_mem_resv() &&
step_spec->pn_min_memory) {
uint32_t mem_use = step_spec->pn_min_memory;
/* ignore current step allocations */
tmp_mem = job_resrcs_ptr->
-- 1.9.3
------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------
Forschungszentrum Juelich GmbH
52425 Juelich
Sitz der Gesellschaft: Juelich
Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498
Vorsitzender des Aufsichtsrats: MinDir Dr. Karl Eugen Huthmacher
Geschaeftsfuehrung: Prof. Dr.-Ing. Wolfgang Marquardt (Vorsitzender),
Karsten Beneke (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt,
Prof. Dr. Sebastian M. Schmidt
------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------
--
Morris "Moe" Jette
CTO, SchedMD LLC
Loading...