Discussion:
Any plans to fix sbatch -N -c bad constraints?
Franco Broi
2014-06-17 00:48:24 UTC
Permalink
You can't currently submit a job with -Nmin<max:max and -c < all cpus,
you get a bad constraints error.

A few people have reported this bug over the past several months but I
haven't seen an mention of a fix.

Cheers,
Franco Broi
2014-07-16 10:08:31 UTC
Permalink
Hi

Been looking into this a bit more and it seems that part of the problem
is in sbatch where it modifies the ntasks value.

src/sbatch/opt.c" line 2406


/* massage the numbers */
if ((opt.nodes_set || opt.extra_set) &&
((opt.min_nodes == opt.max_nodes) || (opt.max_nodes == 0)) &&
!opt.ntasks_set) {
/* 1 proc / node default */
opt.ntasks = MAX(opt.min_nodes, 1);

If I remove the check for opt.min_nodes == opt.max_nodes, my job works.

I also made a change in src/slurmctld/node_scheduler.c at line 846 to
set req_nodes = to max_nodes instead of min_nodes but I'm not sure that
does anything, it just looked wrong. I'll change it back tomorrow and
see if my job still works.

This is the command that would normally fail but now works, d1 has 16
nodes each with 16 cores and I'm using con_res with CR_CPU.

sbatch -p d1 -N15-16 -c 4

but, any value of min_cpu <= num_cpus only allocates 4 nodes, -N5-16
gives me 16 nodes - weird!

Cheers,
Post by Franco Broi
You can't currently submit a job with -Nmin<max:max and -c < all cpus,
you get a bad constraints error.
A few people have reported this bug over the past several months but I
haven't seen an mention of a fix.
Cheers,
Franco Broi
2014-07-21 07:03:33 UTC
Permalink
This patch allows us to submit jobs with min_nodes < max_nodes and
num_cpus < max_cpus but it breaks down when num_nodes <= num_cpus.

partition d1 has 16 nodes, each with 16 cpus and we are using
SelectType=select/cons_res
SelectTypeParameters=CR_CPU


sbatch -p d1 -N15-16 -c 4

The above allocates 16 nodes when available but

sbatch -p d1 -N4-16 -c 4

only allocates 4 nodes even if more are available.


--- slurm-14.03.6/src/sbatch/opt.c 2014-07-17 06:48:18.000000000 +0800
+++ slurm-14.03.6.new/src/sbatch/opt.c 2014-07-17 08:16:39.000000000 +0800
@@ -2403,9 +2403,7 @@
}

/* massage the numbers */
- if ((opt.nodes_set || opt.extra_set) &&
- ((opt.min_nodes == opt.max_nodes) || (opt.max_nodes == 0)) &&
- !opt.ntasks_set) {
+ if (!opt.ntasks_set && (opt.nodes_set || opt.extra_set)) {
/* 1 proc / node default */
opt.ntasks = MAX(opt.min_nodes, 1);

diff -Nur -x .deps -x Makefile -x .libs slurm-14.03.6/src/slurmctld/node_scheduler.c slurm-14.03.6.new/src/slurmctld/node_scheduler.c
--- slurm-14.03.6/src/slurmctld/node_scheduler.c 2014-07-17 06:48:18.000000000 +0800
+++ slurm-14.03.6.new/src/slurmctld/node_scheduler.c 2014-07-17 08:11:06.000000000 +0800
@@ -843,7 +843,7 @@
}
feature_bitmap = NULL;
min_nodes = feat_ptr->count;
- req_nodes = feat_ptr->count;
+ req_nodes = MAX(feat_ptr->count, max_nodes);
job_ptr->details->min_nodes = feat_ptr->count;
job_ptr->details->min_cpus = feat_ptr->count;
if (*preemptee_job_list) {
Post by Franco Broi
Hi
Been looking into this a bit more and it seems that part of the problem
is in sbatch where it modifies the ntasks value.
src/sbatch/opt.c" line 2406
/* massage the numbers */
if ((opt.nodes_set || opt.extra_set) &&
((opt.min_nodes == opt.max_nodes) || (opt.max_nodes == 0)) &&
!opt.ntasks_set) {
/* 1 proc / node default */
opt.ntasks = MAX(opt.min_nodes, 1);
If I remove the check for opt.min_nodes == opt.max_nodes, my job works.
I also made a change in src/slurmctld/node_scheduler.c at line 846 to
set req_nodes = to max_nodes instead of min_nodes but I'm not sure that
does anything, it just looked wrong. I'll change it back tomorrow and
see if my job still works.
This is the command that would normally fail but now works, d1 has 16
nodes each with 16 cores and I'm using con_res with CR_CPU.
sbatch -p d1 -N15-16 -c 4
but, any value of min_cpu <= num_cpus only allocates 4 nodes, -N5-16
gives me 16 nodes - weird!
Cheers,
Post by Franco Broi
You can't currently submit a job with -Nmin<max:max and -c < all cpus,
you get a bad constraints error.
A few people have reported this bug over the past several months but I
haven't seen an mention of a fix.
Cheers,
Loading...