This patch allows us to submit jobs with min_nodes < max_nodes and
num_cpus < max_cpus but it breaks down when num_nodes <= num_cpus.
partition d1 has 16 nodes, each with 16 cpus and we are using
SelectType=select/cons_res
SelectTypeParameters=CR_CPU
sbatch -p d1 -N15-16 -c 4
The above allocates 16 nodes when available but
sbatch -p d1 -N4-16 -c 4
only allocates 4 nodes even if more are available.
--- slurm-14.03.6/src/sbatch/opt.c 2014-07-17 06:48:18.000000000 +0800
+++ slurm-14.03.6.new/src/sbatch/opt.c 2014-07-17 08:16:39.000000000 +0800
@@ -2403,9 +2403,7 @@
}
/* massage the numbers */
- if ((opt.nodes_set || opt.extra_set) &&
- ((opt.min_nodes == opt.max_nodes) || (opt.max_nodes == 0)) &&
- !opt.ntasks_set) {
+ if (!opt.ntasks_set && (opt.nodes_set || opt.extra_set)) {
/* 1 proc / node default */
opt.ntasks = MAX(opt.min_nodes, 1);
diff -Nur -x .deps -x Makefile -x .libs slurm-14.03.6/src/slurmctld/node_scheduler.c slurm-14.03.6.new/src/slurmctld/node_scheduler.c
--- slurm-14.03.6/src/slurmctld/node_scheduler.c 2014-07-17 06:48:18.000000000 +0800
+++ slurm-14.03.6.new/src/slurmctld/node_scheduler.c 2014-07-17 08:11:06.000000000 +0800
@@ -843,7 +843,7 @@
}
feature_bitmap = NULL;
min_nodes = feat_ptr->count;
- req_nodes = feat_ptr->count;
+ req_nodes = MAX(feat_ptr->count, max_nodes);
job_ptr->details->min_nodes = feat_ptr->count;
job_ptr->details->min_cpus = feat_ptr->count;
if (*preemptee_job_list) {
Post by Franco BroiHi
Been looking into this a bit more and it seems that part of the problem
is in sbatch where it modifies the ntasks value.
src/sbatch/opt.c" line 2406
/* massage the numbers */
if ((opt.nodes_set || opt.extra_set) &&
((opt.min_nodes == opt.max_nodes) || (opt.max_nodes == 0)) &&
!opt.ntasks_set) {
/* 1 proc / node default */
opt.ntasks = MAX(opt.min_nodes, 1);
If I remove the check for opt.min_nodes == opt.max_nodes, my job works.
I also made a change in src/slurmctld/node_scheduler.c at line 846 to
set req_nodes = to max_nodes instead of min_nodes but I'm not sure that
does anything, it just looked wrong. I'll change it back tomorrow and
see if my job still works.
This is the command that would normally fail but now works, d1 has 16
nodes each with 16 cores and I'm using con_res with CR_CPU.
sbatch -p d1 -N15-16 -c 4
but, any value of min_cpu <= num_cpus only allocates 4 nodes, -N5-16
gives me 16 nodes - weird!
Cheers,
Post by Franco BroiYou can't currently submit a job with -Nmin<max:max and -c < all cpus,
you get a bad constraints error.
A few people have reported this bug over the past several months but I
haven't seen an mention of a fix.
Cheers,