Discussion:
preemption and topology plugin
Marcin Sliwowski
2014-09-23 19:03:34 UTC
Permalink
I'm running version 2.6.9 and wondering if the preemption algorithm
takes into account the topology, as defined in topology.conf, when it
selects which jobs to preempt to make room for a new higher priority MPI
job.

Based on what I have seen it appears that it doesn't.

The reason I ask is that we define our infiniband topology as 8
individual fabrics because we have 8 bladecenters that each have their
own fabric, they are not interconnected, one partition includes all 8
bladecenters, 32 nodes per bladecenter.

Eventually enough jobs are preempted and the MPI job is scheduled into a
bladecenter, but it comes at the cost of many jobs. The main problem is
that it preempts jobs on bladecenters where the MPI job does not
ultimately land.

If it took into consideration our defined topology and focused on
preempting jobs that reside in a single bladecenter, it could make room
for the MPI job with a much lower number of preempted jobs.

We have been scratching our heads on this one for a while.

SelectType=select/cons_res
PreemptType=preempt/partition_prio
TopologyPlugin=topology/tree

Thanks
--
Marcin Sliwowski | ***@RENCI | 919-445-0479
Marcin Sliwowski
2014-09-29 14:06:32 UTC
Permalink
Can I submit a RFE for the partition_prio preemption plugin?

Looking through the partition_prio plugins source code, from what I can
gather it does not appear to be topology aware.

At least not in the way that the consumable resources selection plugin
is, this one has comment blocks referring to selection based upon
topology and topology state information.

Thanks
Post by Marcin Sliwowski
I'm running version 2.6.9 and wondering if the preemption algorithm
takes into account the topology, as defined in topology.conf, when it
selects which jobs to preempt to make room for a new higher priority
MPI job.
Based on what I have seen it appears that it doesn't.
The reason I ask is that we define our infiniband topology as 8
individual fabrics because we have 8 bladecenters that each have their
own fabric, they are not interconnected, one partition includes all 8
bladecenters, 32 nodes per bladecenter.
Eventually enough jobs are preempted and the MPI job is scheduled into
a bladecenter, but it comes at the cost of many jobs. The main problem
is that it preempts jobs on bladecenters where the MPI job does not
ultimately land.
If it took into consideration our defined topology and focused on
preempting jobs that reside in a single bladecenter, it could make
room for the MPI job with a much lower number of preempted jobs.
We have been scratching our heads on this one for a while.
SelectType=select/cons_res
PreemptType=preempt/partition_prio
TopologyPlugin=topology/tree
Thanks
Loading...