Complex job dependency issues

Discussion:

Paul Mezzanini

2012-03-23 16:04:02 UTC

Are there known issues with large dependency lists? We have a user who is
doing a fairly large number of generational jobs. The basic view of her
workflow is she spawns N number of workers that need to complete before
the next generation of N workers can start.

Her current set is 16 workers and I have no idea how many generations.
She can submit up to around generation 7 before things really go south.
We start to see the effects around generation 4 (submits slow down
slightly). The moment generation 7 begins submitting the speed drops
significantly. Slurmctld's cpu usage goes to 100% and I begin to get
warning messages about processing time in the slurmctld logs (slurmctld:
Warning: Note very large processing time from _slurm_rpc_submit_batch_job:
usec=2735283). Turning the verbosity up yielded no obvious issues.

Eventually sbatch fails with timeouts and that kills the rest of the
submits.

As a test we slowed her submit script down with a few sleep calls to see
if we were overwhelming slurmctld. The same slowdown occurred at
generation 7.

I have created a very simplified version of her submit scripts for
testing. It shows the same issues.

Important info:
slurm 2.3.1.
Controller is a KVM VM with 2 processors (AMD 2.8ghz) and 14G ram
No memory/disk limits appear to be the issue.
Generation G's jobs only have G-1's jobs listed as a dependencies.

My submit scripts for testing:

####BEGIN CONSOLE DUMP####

[***@tropos submitotron []]# cat submit-many-jobs.sh
#!/bin/bash

# Just a constant variable used throughout the script to name our jobs
# in a meaningful way.
BASEJOBNAME="dep"

# Another constant variable used to name the slurm submission file that
# this script is going to submit to slurm.
JOBFILE="slurm-payload.sh"

#Generations requested
NUMBEROFGENERATIONS=16
#Workers per generation
NUMBEROFWORKERS=16

#The first generation has no dependency so it has its own loop.
#
#We capture the job number slurm spits out and put it into an array with
the index being the generation.
#Future jobs can then reference $GENERATION - 1 to set dependency.

for GENERATION in $(seq 1 ${NUMBEROFGENERATIONS}) ; do
if [ ${GENERATION} -eq 1 ] ; then
for WORKER in $(seq 1 ${NUMBEROFWORKERS}) ; do
echo GENERATION/WORKER: ${GENERATION}/${WORKER}
WORKERLIST[${GENERATION}]=$(sbatch --qos=rc-normal -o /dev/null -J
${BASEJOBNAME}-${GENERATION}-${WORKER} ${JOBFILE} | awk ' { print $4
}'):${WORKERLIST[${GENERATION}]}
done
else
for WORKER in $(seq 1 ${NUMBEROFWORKERS}) ; do
echo GENERATION/WORKER: ${GENERATION}/${WORKER}
WORKERLIST[${GENERATION}]=$(sbatch --qos=rc-normal -o /dev/null
--dependency=afterok:${WORKERLIST[$(expr ${GENERATION} - 1)]%\:} -J
${BASEJOBNAME}-${GENERATION}-${WORKER} ${JOBFILE} | awk ' { print $4
}'):${WORKERLIST[${GENERATION}]}
done
fi
done
[***@tropos submitotron []]# cat slurm-payload.sh
#!/bin/bash -l
# NOTE the -l flag!
#

# Where to send mail...
#SBATCH --mail-user pfmeec-***@public.gmane.org

# notify on state change: BEGIN, END, FAIL or ALL
#SBATCH --mail-type=FAIL

# Request run time MAX H:M:S , anything over will be KILLED
#SBATCH -t 0:1:30

#vaild partions are "work" and "debug"
#SBATCH -p work -n 1

# Job memory requirements in MB
#SBATCH --mem=30

#Just a quick sleep.
sleep 60

[***@tropos submitotron []]#

####END CONSOLE DUMP####

Wow, that totally killed my indentation. Github version:
https://github.com/paulmezz/SlurmThings

I know there are ways I could clean up the loops but for this test I just
don't care :)

Any ideas? (and thanks!)
-paul

--
Paul.Mezzanini-***@public.gmane.org
Sr Systems Administrator/Engineer
Research Computing at RIT
585.475.3245

Alejandro Lucero Palau

2012-03-23 17:04:03 UTC

Permalink

We have a similar situation in one of our machines. Users are processing
genomic data and a java program controls a complex job graph which sends
thousands of jobs with dependencies. Sometimes they have had problems
with dependencies list which is limited to 1024 bytes. Other problem is
these jobs can make an impact on performance due to checking circular
dependencies which fires a recursive function. It could happen recursive
calls exhausting reserved stack memory per thread that with glibc is 2MB.

But I guess what you are suffering is another related problem.
Scheduling algorithms have some limits for processing jobs avoiding to
spend a long time each cycle if a large number of jobs are queued. The
problem with dependencies is such a limit could never be reached if all
jobs have dependencies since the scheduler does not try to get resources
for them. But with thousands of jobs, just checking dependencies can be
as bad as processing a large number of jobs in each cycle.

I have in mind some tests for solving this issue. Initially this is
simple but I wonder if it could lead to some dependency deadlock for
queued jobs.

Post by Paul Mezzanini
Are there known issues with large dependency lists? We have a user who is
doing a fairly large number of generational jobs. The basic view of her
workflow is she spawns N number of workers that need to complete before
the next generation of N workers can start.
Her current set is 16 workers and I have no idea how many generations.
She can submit up to around generation 7 before things really go south.
We start to see the effects around generation 4 (submits slow down
slightly). The moment generation 7 begins submitting the speed drops
significantly. Slurmctld's cpu usage goes to 100% and I begin to get
usec=2735283). Turning the verbosity up yielded no obvious issues.
Eventually sbatch fails with timeouts and that kills the rest of the
submits.
As a test we slowed her submit script down with a few sleep calls to see
if we were overwhelming slurmctld. The same slowdown occurred at
generation 7.
I have created a very simplified version of her submit scripts for
testing. It shows the same issues.
slurm 2.3.1.
Controller is a KVM VM with 2 processors (AMD 2.8ghz) and 14G ram
No memory/disk limits appear to be the issue.
Generation G's jobs only have G-1's jobs listed as a dependencies.
####BEGIN CONSOLE DUMP####
#!/bin/bash
# Just a constant variable used throughout the script to name our jobs
# in a meaningful way.
BASEJOBNAME="dep"
# Another constant variable used to name the slurm submission file that
# this script is going to submit to slurm.
JOBFILE="slurm-payload.sh"
#Generations requested
NUMBEROFGENERATIONS=16
#Workers per generation
NUMBEROFWORKERS=16
#The first generation has no dependency so it has its own loop.
#
#We capture the job number slurm spits out and put it into an array with
the index being the generation.
#Future jobs can then reference $GENERATION - 1 to set dependency.
for GENERATION in $(seq 1 ${NUMBEROFGENERATIONS}) ; do
if [ ${GENERATION} -eq 1 ] ; then
for WORKER in $(seq 1 ${NUMBEROFWORKERS}) ; do
echo GENERATION/WORKER: ${GENERATION}/${WORKER}
WORKERLIST[${GENERATION}]=$(sbatch --qos=rc-normal -o /dev/null -J
${BASEJOBNAME}-${GENERATION}-${WORKER} ${JOBFILE} | awk ' { print $4
}'):${WORKERLIST[${GENERATION}]}
done
else
for WORKER in $(seq 1 ${NUMBEROFWORKERS}) ; do
echo GENERATION/WORKER: ${GENERATION}/${WORKER}
WORKERLIST[${GENERATION}]=$(sbatch --qos=rc-normal -o /dev/null
--dependency=afterok:${WORKERLIST[$(expr ${GENERATION} - 1)]%\:} -J
${BASEJOBNAME}-${GENERATION}-${WORKER} ${JOBFILE} | awk ' { print $4
}'):${WORKERLIST[${GENERATION}]}
done
fi
done
#!/bin/bash -l
# NOTE the -l flag!
#
# Where to send mail...
# notify on state change: BEGIN, END, FAIL or ALL
#SBATCH --mail-type=FAIL
# Request run time MAX H:M:S , anything over will be KILLED
#SBATCH -t 0:1:30
#vaild partions are "work" and "debug"
#SBATCH -p work -n 1
# Job memory requirements in MB
#SBATCH --mem=30
#Just a quick sleep.
sleep 60
####END CONSOLE DUMP####
https://github.com/paulmezz/SlurmThings
I know there are ways I could clean up the loops but for this test I just
don't care :)
Any ideas? (and thanks!)
-paul

WARNING / LEGAL TEXT: This message is intended only for the use of the
individual or entity to which it is addressed and may contain
information which is privileged, confidential, proprietary, or exempt
from disclosure under applicable law. If you are not the intended
recipient or the person responsible for delivering the message to the
intended recipient, you are strictly prohibited from disclosing,
distributing, copying, or in any way using this message. If you have
received this communication in error, please notify the sender and
destroy and delete any copies you may have received.

http://www.bsc.es/disclaimer.htm

Moe Jette

2012-03-23 20:54:03 UTC

Permalink

I'm not sure if this will satisfy your user's needs, but SLURM can
handle job steps a lot more efficiently than jobs. If each generation
can be combined into a single job with a bunch of job steps, that may
help. There is also some information about high throughput computing
on-line that may be helpful:
http://www.schedmd.com/slurmdocs/high_throughput.html

Paul Mezzanini

2012-03-23 22:24:03 UTC

Permalink

To add insult to injury, this logic works perfect in SGE.

Sent from my Samsung smartphone on AT&T

-------- Original message --------
Subject: [slurm-dev] Re: Complex job dependency issues
From: Moe Jette <***@schedmd.com>
To: slurm-dev <slurm-***@schedmd.com>
CC:

I'm not sure if this will satisfy your user's needs, but SLURM can
handle job steps a lot more efficiently than jobs. If each generation
can be combined into a single job with a bunch of job steps, that may
help. There is also some information about high throughput computing
on-line that may be helpful:
http://www.schedmd.com/slurmdocs/high_throughput.html

Matthieu Hautreux

2012-03-24 23:30:06 UTC

Permalink

Post by Paul Mezzanini
To add insult to injury, this logic works perfect in SGE.

Hum, that is surely not a fair comment :)

Your reproducer works as expected and I see the same exact behavior at
generation 7.

Alejandro is correct, and doing profiling on the slurmctld while submitting
the 7th generation, we can see that the _scan_dependency logic is taking
really too much time. This is because of the fact that slurm is walking
through all the dependencies recursively and at generation 7 for your
workload it corresponds to 16^6=16777216 items... That starts to be a
little bit unmanageable. I let you imagine the result at generation 128...
I am happy that SGE is able to manage that, it certainly does not check for
circular dependencies the way SLURM does it. Maybe it does not do it at all
or only for direct ancestors.

As the problem comes from the fact that SLURM is trying to detect a problem
that could become too hard to solve, maybe it should stop before killing
itself because of the complexity of the thing and consider that when too
much layers of dependencies are used, it has to consider that the user is
knowing what he does. Maybe a counter should be added in the logic and the
recursion stopped when a max value is reached.

In your workload, every job reuse the dependencies of the other jobs of the
same generation. The dependencies check could be reduced from 16^6 to 16*6
by doing a merge of all the dependencies at each level before doing the
recursion.

To sum up, IMHO, SLURM should (1) use a counter to stop trying to solve
the _real-time_ unsolvable and (2) do a merge of the dependencies at each
dependency level before entering a recursion (to eliminate redundancy).

In the mean time, if you are confident that you will not have circular
dependencies, you (or I, if you want !) can do a quick patch to remove the
check using an env var and see if there is an other problem a few steps
after.

Regards,
Matthieu

Post by Paul Mezzanini
Sent from my Samsung smartphone on AT&T
-------- Original message --------
Subject: [slurm-dev] Re: Complex job dependency issues
I'm not sure if this will satisfy your user's needs, but SLURM can
handle job steps a lot more efficiently than jobs. If each generation
can be combined into a single job with a bunch of job steps, that may
help. There is also some information about high throughput computing
http://www.schedmd.com/slurmdocs/high_throughput.html

Post by Paul Mezzanini
Are there known issues with large dependency lists? We have a user who

Post by Paul Mezzanini
doing a fairly large number of generational jobs. The basic view of her
workflow is she spawns N number of workers that need to complete before
the next generation of N workers can start.
Her current set is 16 workers and I have no idea how many generations.
She can submit up to around generation 7 before things really go south.
We start to see the effects around generation 4 (submits slow down
slightly). The moment generation 7 begins submitting the speed drops
significantly. Slurmctld's cpu usage goes to 100% and I begin to get
Warning: Note very large processing time from
usec=2735283). Turning the verbosity up yielded no obvious issues.
Eventually sbatch fails with timeouts and that kills the rest of the
submits.
As a test we slowed her submit script down with a few sleep calls to see
if we were overwhelming slurmctld. The same slowdown occurred at
generation 7.
I have created a very simplified version of her submit scripts for
testing. It shows the same issues.
slurm 2.3.1.
Controller is a KVM VM with 2 processors (AMD 2.8ghz) and 14G ram
No memory/disk limits appear to be the issue.
Generation G's jobs only have G-1's jobs listed as a dependencies.
####BEGIN CONSOLE DUMP####
#!/bin/bash
# Just a constant variable used throughout the script to name our jobs
# in a meaningful way.
BASEJOBNAME="dep"
# Another constant variable used to name the slurm submission file that
# this script is going to submit to slurm.
JOBFILE="slurm-payload.sh"
#Generations requested
NUMBEROFGENERATIONS=16
#Workers per generation
NUMBEROFWORKERS=16
#The first generation has no dependency so it has its own loop.
#
#We capture the job number slurm spits out and put it into an array with
the index being the generation.
#Future jobs can then reference $GENERATION - 1 to set dependency.
for GENERATION in $(seq 1 ${NUMBEROFGENERATIONS}) ; do
if [ ${GENERATION} -eq 1 ] ; then
for WORKER in $(seq 1 ${NUMBEROFWORKERS}) ; do
echo GENERATION/WORKER: ${GENERATION}/${WORKER}
WORKERLIST[${GENERATION}]=$(sbatch

--qos=rc-normal -o /dev/null -J

Post by Paul Mezzanini
${BASEJOBNAME}-${GENERATION}-${WORKER} ${JOBFILE} | awk ' { print $4
}'):${WORKERLIST[${GENERATION}]}
done
else
for WORKER in $(seq 1 ${NUMBEROFWORKERS}) ; do
echo GENERATION/WORKER: ${GENERATION}/${WORKER}
WORKERLIST[${GENERATION}]=$(sbatch

--qos=rc-normal -o /dev/null

Post by Paul Mezzanini
--dependency=afterok:${WORKERLIST[$(expr ${GENERATION} - 1)]%\:} -J
${BASEJOBNAME}-${GENERATION}-${WORKER} ${JOBFILE} | awk ' { print $4
}'):${WORKERLIST[${GENERATION}]}
done
fi
done
#!/bin/bash -l
# NOTE the -l flag!
#
# Where to send mail...
# notify on state change: BEGIN, END, FAIL or ALL
#SBATCH --mail-type=FAIL
# Request run time MAX H:M:S , anything over will be KILLED
#SBATCH -t 0:1:30
#vaild partions are "work" and "debug"
#SBATCH -p work -n 1
# Job memory requirements in MB
#SBATCH --mem=30
#Just a quick sleep.
sleep 60
####END CONSOLE DUMP####
https://github.com/paulmezz/SlurmThings
I know there are ways I could clean up the loops but for this test I just
don't care :)
Any ideas? (and thanks!)
-paul
--
Sr Systems Administrator/Engineer
Research Computing at RIT
585.475.3245

Paul Mezzanini

2012-03-25 19:56:04 UTC

Permalink

Matthieu,

I know I lack the programming skill to implement the patch myself. I could probably hack it together but it would be far from pretty. I'm also hesitant to patch the controller for one user. For the time being, I think I will have my user modify her submit scripts to only have one or two generations "on deck". She will be defending soon so rewriting her workflow to utilize steps just isn't worth it.

This issue should be addressed in future revisions. It came as quite a surprise when a user was able to bring the controller to its knees. I can also see it cropping up every year or so.

Kids are getting restless, I better get off the laptop.

Thanks for the programming offer, I may still take you up on it.

Paul

-------------------------------
From: Matthieu Hautreux [mailto:***@gmail.com]
Sent: Saturday, March 24, 2012 7:40 PM
To: slurm-dev
Subject: [slurm-dev] Re: Complex job dependency issues

2012/3/23 Paul Mezzanini <***@rit.edu>
To add insult to injury, this logic works perfect in SGE.

Hum, that is surely not a fair comment :)

Your reproducer works as expected and I see the same exact behavior at generation 7.

Alejandro is correct, and doing profiling on the slurmctld while submitting the 7th generation, we can see that the _scan_dependency logic is taking really too much time. This is because of the fact that slurm is walking through all the dependencies recursively and at generation 7 for your workload it corresponds to 16^6=16777216 items... That starts to be a little bit unmanageable. I let you imagine the result at generation 128... I am happy that SGE is able to manage that, it certainly does not check for circular dependencies the way SLURM does it. Maybe it does not do it at all or only for direct ancestors.

As the problem comes from the fact that SLURM is trying to detect a problem that could become too hard to solve, maybe it should stop before killing itself because of the complexity of the thing and consider that when too much layers of dependencies are used, it has to consider that the user is knowing what he does. Maybe a counter should be added in the logic and the recursion stopped when a max value is reached.

In your workload, every job reuse the dependencies of the other jobs of the same generation. The dependencies check could be reduced from 16^6 to 16*6 by doing a merge of all the dependencies at each level before doing the recursion.

To sum up, IMHO, SLURM should (1) use a counter to stop trying to solve the _real-time_ unsolvable and (2) do a merge of the dependencies at each dependency level before entering a recursion (to eliminate redundancy).

In the mean time, if you are confident that you will not have circular dependencies, you (or I, if you want !) can do a quick patch to remove the check using an env var and see if there is an other problem a few steps after.

Regards,
Matthieu

Sent from my Samsung smartphone on AT&T

-------- Original message --------
Subject: [slurm-dev] Re: Complex job dependency issues
From: Moe Jette <***@schedmd.com>
To: slurm-dev <slurm-***@schedmd.com>
CC:

I'm not sure if this will satisfy your user's needs, but SLURM can
handle job steps a lot more efficiently than jobs. If each generation
can be combined into a single job with a bunch of job steps, that may
help. There is also some information about high throughput computing
on-line that may be helpful:
http://www.schedmd.com/slurmdocs/high_throughput.html

Are there known issues with large dependency lists? We have a user who is
doing a fairly large number of generational jobs. The basic view of her
workflow is she spawns N number of workers that need to complete before
the next generation of N workers can start.
Her current set is 16 workers and I have no idea how many generations.
She can submit up to around generation 7 before things really go south.
We start to see the effects around generation 4 (submits slow down
slightly). The moment generation 7 begins submitting the speed drops
significantly. Slurmctld's cpu usage goes to 100% and I begin to get
usec=2735283). Turning the verbosity up yielded no obvious issues.
Eventually sbatch fails with timeouts and that kills the rest of the
submits.
As a test we slowed her submit script down with a few sleep calls to see
if we were overwhelming slurmctld. The same slowdown occurred at
generation 7.
I have created a very simplified version of her submit scripts for
testing. It shows the same issues.
slurm 2.3.1.
Controller is a KVM VM with 2 processors (AMD 2.8ghz) and 14G ram
No memory/disk limits appear to be the issue.
Generation G's jobs only have G-1's jobs listed as a dependencies.
####BEGIN CONSOLE DUMP####
#!/bin/bash
# Just a constant variable used throughout the script to name our jobs
#   in a meaningful way.
BASEJOBNAME="dep"
# Another constant variable used to name the slurm submission file that
#   this script is going to submit to slurm.
JOBFILE="slurm-payload.sh"
#Generations requested
NUMBEROFGENERATIONS=16
#Workers per generation
NUMBEROFWORKERS=16
#The first generation has no dependency so it has its own loop.
#
#We capture the job number slurm spits out and put it into an array with
the index being the generation.
#Future jobs can then reference $GENERATION - 1 to set dependency.
for GENERATION in $(seq 1 ${NUMBEROFGENERATIONS}) ; do
        if [ ${GENERATION} -eq 1 ] ; then
                for WORKER in $(seq 1 ${NUMBEROFWORKERS}) ; do
                        echo GENERATION/WORKER: ${GENERATION}/${WORKER}
                        WORKERLIST[${GENERATION}]=$(sbatch --qos=rc-normal -o /dev/null -J
${BASEJOBNAME}-${GENERATION}-${WORKER} ${JOBFILE} | awk ' { print $4
}'):${WORKERLIST[${GENERATION}]}
                done
        else
                for WORKER in $(seq 1 ${NUMBEROFWORKERS}) ; do
                        echo GENERATION/WORKER: ${GENERATION}/${WORKER}
                        WORKERLIST[${GENERATION}]=$(sbatch --qos=rc-normal -o /dev/null
--dependency=afterok:${WORKERLIST[$(expr ${GENERATION} - 1)]%\:} -J
${BASEJOBNAME}-${GENERATION}-${WORKER} ${JOBFILE} | awk ' { print $4
}'):${WORKERLIST[${GENERATION}]}
                done
        fi
done
#!/bin/bash -l
# NOTE the -l flag!
#
# Where to send mail...
# notify on state change: BEGIN, END, FAIL or ALL
#SBATCH --mail-type=FAIL
# Request run time MAX H:M:S , anything over will be KILLED
#SBATCH -t 0:1:30
#vaild partions are "work" and "debug"
#SBATCH -p work -n 1
# Job memory requirements in MB
#SBATCH --mem=30
#Just a quick sleep.
sleep 60
####END CONSOLE DUMP####
https://github.com/paulmezz/SlurmThings
I know there are ways I could clean up the loops but for this test I just
don't care :)
Any ideas? (and thanks!)
-paul
--
Sr Systems Administrator/Engineer
Research Computing at RIT
58

Alejandro Lucero Palau

2012-03-26 06:44:02 UTC

Permalink

As I said before, we have similar problems with a machine which is not
used as "standard" HPC software expects. This is more about job
throughput , aka HTC, than job performance and a software like Slurm was
not designed for this use, although it can do a really good job under
some limits.

I wonder if we could talk about this issue next Slurm Users Meeting.
Slurm is doing well with current big clusters and some work is being
done in scalability but I guess HTC needs another sort of algorithms.
Being Slurm so dynamically configurable thanks to plugins, one first
thing coming out of my head is to implement main scheduler as a
configurable option as well. I think this is a better idea than adding
corner cases for HTC purposes.

Post by Paul Mezzanini
Matthieu,
I know I lack the programming skill to implement the patch myself. I could probably hack it together but it would be far from pretty. I'm also hesitant to patch the controller for one user. For the time being, I think I will have my user modify her submit scripts to only have one or two generations "on deck". She will be defending soon so rewriting her workflow to utilize steps just isn't worth it.
This issue should be addressed in future revisions. It came as quite a surprise when a user was able to bring the controller to its knees. I can also see it cropping up every year or so.
Kids are getting restless, I better get off the laptop.
Thanks for the programming offer, I may still take you up on it.
Paul
-------------------------------
Sent: Saturday, March 24, 2012 7:40 PM
To: slurm-dev
Subject: [slurm-dev] Re: Complex job dependency issues
To add insult to injury, this logic works perfect in SGE.
Hum, that is surely not a fair comment :)
Your reproducer works as expected and I see the same exact behavior at generation 7.
Alejandro is correct, and doing profiling on the slurmctld while submitting the 7th generation, we can see that the _scan_dependency logic is taking really too much time. This is because of the fact that slurm is walking through all the dependencies recursively and at generation 7 for your workload it corresponds to 16^6=16777216 items... That starts to be a little bit unmanageable. I let you imagine the result at generation 128... I am happy that SGE is able to manage that, it certainly does not check for circular dependencies the way SLURM does it. Maybe it does not do it at all or only for direct ancestors.
As the problem comes from the fact that SLURM is trying to detect a problem that could become too hard to solve, maybe it should stop before killing itself because of the complexity of the thing and consider that when too much layers of dependencies are used, it has to consider that the user is knowing what he does. Maybe a counter should be added in the logic and the recursion stopped when a max value is reached.
In your workload, every job reuse the dependencies of the other jobs of the same generation. The dependencies check could be reduced from 16^6 to 16*6 by doing a merge of all the dependencies at each level before doing the recursion.
To sum up, IMHO, SLURM should (1) use a counter to stop trying to solve the _real-time_ unsolvable and (2) do a merge of the dependencies at each dependency level before entering a recursion (to eliminate redundancy).
In the mean time, if you are confident that you will not have circular dependencies, you (or I, if you want !) can do a quick patch to remove the check using an env var and see if there is an other problem a few steps after.
Regards,
Matthieu
Sent from my Samsung smartphone on AT&T
-------- Original message --------
Subject: [slurm-dev] Re: Complex job dependency issues
I'm not sure if this will satisfy your user's needs, but SLURM can
handle job steps a lot more efficiently than jobs. If each generation
can be combined into a single job with a bunch of job steps, that may
help. There is also some information about high throughput computing
http://www.schedmd.com/slurmdocs/high_throughput.html

Matthieu Hautreux

2012-03-26 21:14:04 UTC

Permalink

Alejandro, I am okay with what you said but the main problem is that
Paul has provided what we could call a 0-day for doing a DOS on SLURM
controller in about a few seconds.... At least a mechanism should be
added to protect the controller from entering an exponential loop and
being unresponsive for a long long time because of circular
dependencies check.

My 2 cents
Matthieu

Post by Alejandro Lucero Palau
As I said before, we have similar problems with a machine which is not
used as "standard" HPC software expects. This is more about job
throughput , aka HTC, than job performance and a software like Slurm was
not designed for this use, although it can do a really good job under
some limits.
I wonder if we could talk about this issue next Slurm Users Meeting.
Slurm is doing well with current big clusters and some work is being
done in scalability but I guess HTC needs another sort of algorithms.
Being Slurm so dynamically configurable thanks to plugins, one first
thing coming out of my head is to implement main scheduler as a
configurable option as well. I think this is a better idea than adding
corner cases for HTC purposes.

Are there known issues with large dependency lists? We have a user who is
doing a fairly large number of generational jobs. The basic view of her
workflow is she spawns N number of workers that need to complete before
the next generation of N workers can start.
Her current set is 16 workers and I have no idea how many generations.
She can submit up to around generation 7 before things really go south.
We start to see the effects around generation 4 (submits slow down
slightly). The moment generation 7 begins submitting the speed drops
significantly. Slurmctld's cpu usage goes to 100% and I begin to get
usec=2735283). Turning the verbosity up yielded no obvious issues.
Eventually sbatch fails with timeouts and that kills the rest of the
submits.
As a test we slowed her submit script down with a few sleep calls to see
if we were overwhelming slurmctld. The same slowdown occurred at
generation 7.
I have created a very simplified version of her submit scripts for
testing. It shows the same issues.
slurm 2.3.1.
Controller is a KVM VM with 2 processors (AMD 2.8ghz) and 14G ram
No memory/disk limits appear to be the issue.
Generation G's jobs only have G-1's jobs listed as a dependencies.
####BEGIN CONSOLE DUMP####
#!/bin/bash
# Just a constant variable used throughout the script to name our jobs
# in a meaningful way.
BASEJOBNAME="dep"
# Another constant variable used to name the slurm submission file that
# this script is going to submit to slurm.
JOBFILE="slurm-payload.sh"
#Generations requested
NUMBEROFGENERATIONS=16
#Workers per generation
NUMBEROFWORKERS=16
#The first generation has no dependency so it has its own loop.
#
#We capture the job number slurm spits out and put it into an array with
the index being the generation.
#Future jobs can then reference $GENERATION - 1 to set dependency.
for GENERATION in $(seq 1 ${NUMBEROFGENERATIONS}) ; do
if [ ${GENERATION} -eq 1 ] ; then
for WORKER in $(seq 1 ${NUMBEROFWORKERS}) ; do
echo GENERATION/WORKER: ${GENERATION}/${WORKER}
WORKERLIST[${GENERATION}]=$(sbatch --qos=rc-normal -o /dev/null -J
${BASEJOBNAME}-${GENERATION}-${WORKER} ${JOBFILE} | awk ' { print $4
}'):${WORKERLIST[${GENERATION}]}
done
else
for WORKER in $(seq 1 ${NUMBEROFWORKERS}) ; do
echo GENERATION/WORKER: ${GENERATION}/${WORKER}
WORKERLIST[${GENERATION}]=$(sbatch --qos=rc-normal -o /dev/null
--dependency=afterok:${WORKERLIST[$(expr ${GENERATION} - 1)]%\:} -J
${BASEJOBNAME}-${GENERATION}-${WORKER} ${JOBFILE} | awk ' { print $4
}'):${WORKERLIST[${GENERATION}]}
done
fi
done
#!/bin/bash -l
# NOTE the -l flag!
#
# Where to send mail...
# notify on state change: BEGIN, END, FAIL or ALL
#SBATCH --mail-type=FAIL
# Request run time MAX H:M:S , anything over will be KILLED
#SBATCH -t 0:1:30
#vaild partions are "work" and "debug"
#SBATCH -p work -n 1
# Job memory requirements in MB
#SBATCH --mem=30
#Just a quick sleep.
sleep 60
####END CONSOLE DUMP####
https://github.com/paulmezz/SlurmThings
I know there are ways I could clean up the loops but for this test I just
don't care :)
Any ideas? (and thanks!)
-paul
--
Sr Systems Administrator/Engineer
Research Computing at RIT
585.475.3245

Moe Jette

2012-04-02 22:22:05 UTC

Permalink

To be fixed in SLURM v2.3.6. See attached patch or
https://github.com/SchedMD/slurm/commit/0caecbc53dd476cca866b8a48421162b5a25aa2c
Establishes configuration parameter with maximum number of jobs to
check for circular dependencies. Default value 10 jobs.

Post by Matthieu Hautreux
Alejandro, I am okay with what you said but the main problem is that
Paul has provided what we could call a 0-day for doing a DOS on SLURM
controller in about a few seconds.... At least a mechanism should be
added to protect the controller from entering an exponential loop and
being unresponsive for a long long time because of circular
dependencies check.
My 2 cents
Matthieu

Post by Paul Mezzanini
Matthieu,
I know I lack the programming skill to implement the patch myself.
I could probably hack it together but it would be far from
pretty. I'm also hesitant to patch the controller for one user.
For the time being, I think I will have my user modify her submit
scripts to only have one or two generations "on deck". She will
be defending soon so rewriting her workflow to utilize steps just
isn't worth it.
This issue should be addressed in future revisions. It came as
quite a surprise when a user was able to bring the controller to
its knees. I can also see it cropping up every year or so.
Kids are getting restless, I better get off the laptop.
Thanks for the programming offer, I may still take you up on it.
Paul
-------------------------------
Sent: Saturday, March 24, 2012 7:40 PM
To: slurm-dev
Subject: [slurm-dev] Re: Complex job dependency issues
To add insult to injury, this logic works perfect in SGE.
Hum, that is surely not a fair comment :)
Your reproducer works as expected and I see the same exact
behavior at generation 7.
Alejandro is correct, and doing profiling on the slurmctld while
submitting the 7th generation, we can see that the
_scan_dependency logic is taking really too much time. This is
because of the fact that slurm is walking through all the
dependencies recursively and at generation 7 for your workload it
corresponds to 16^6=16777216 items... That starts to be a little
bit unmanageable. I let you imagine the result at generation
128... I am happy that SGE is able to manage that, it certainly
does not check for circular dependencies the way SLURM does it.
Maybe it does not do it at all or only for direct ancestors.
As the problem comes from the fact that SLURM is trying to detect
a problem that could become too hard to solve, maybe it should
stop before killing itself because of the complexity of the thing
and consider that when too much layers of dependencies are used,
it has to consider that the user is knowing what he does. Maybe a
counter should be added in the logic and the recursion stopped
when a max value is reached.
In your workload, every job reuse the dependencies of the other
jobs of the same generation. The dependencies check could be
reduced from 16^6 to 16*6 by doing a merge of all the dependencies
at each level before doing the recursion.
To sum up, IMHO, SLURM should (1) use a counter to stop trying to
solve the _real-time_ unsolvable and (2) do a merge of the
dependencies at each dependency level before entering a recursion
(to eliminate redundancy).
In the mean time, if you are confident that you will not have
circular dependencies, you (or I, if you want !) can do a quick
patch to remove the check using an env var and see if there is an
other problem a few steps after.
Regards,
Matthieu
Sent from my Samsung smartphone on AT&T
-------- Original message --------
Subject: [slurm-dev] Re: Complex job dependency issues
I'm not sure if this will satisfy your user's needs, but SLURM can
handle job steps a lot more efficiently than jobs. If each generation
can be combined into a single job with a bunch of job steps, that may
help. There is also some information about high throughput computing
http://www.schedmd.com/slurmdocs/high_throughput.html

Are there known issues with large dependency lists? We have a user who is
doing a fairly large number of generational jobs. The basic view of her
workflow is she spawns N number of workers that need to complete before
the next generation of N workers can start.
Her current set is 16 workers and I have no idea how many generations.
She can submit up to around generation 7 before things really go south.
We start to see the effects around generation 4 (submits slow down
slightly). The moment generation 7 begins submitting the speed drops
significantly. Slurmctld's cpu usage goes to 100% and I begin to get
usec=2735283). Turning the verbosity up yielded no obvious issues.
Eventually sbatch fails with timeouts and that kills the rest of the
submits.
As a test we slowed her submit script down with a few sleep calls to see
if we were overwhelming slurmctld. The same slowdown occurred at
generation 7.
I have created a very simplified version of her submit scripts for
testing. It shows the same issues.
slurm 2.3.1.
Controller is a KVM VM with 2 processors (AMD 2.8ghz) and 14G ram
No memory/disk limits appear to be the issue.
Generation G's jobs only have G-1's jobs listed as a dependencies.
####BEGIN CONSOLE DUMP####
#!/bin/bash
# Just a constant variable used throughout the script to name our jobs
# in a meaningful way.
BASEJOBNAME="dep"
# Another constant variable used to name the slurm submission file that
# this script is going to submit to slurm.
JOBFILE="slurm-payload.sh"
#Generations requested
NUMBEROFGENERATIONS=16
#Workers per generation
NUMBEROFWORKERS=16
#The first generation has no dependency so it has its own loop.
#
#We capture the job number slurm spits out and put it into an array with
the index being the generation.
#Future jobs can then reference $GENERATION - 1 to set dependency.
for GENERATION in $(seq 1 ${NUMBEROFGENERATIONS}) ; do
if [ ${GENERATION} -eq 1 ] ; then
for WORKER in $(seq 1 ${NUMBEROFWORKERS}) ; do
echo GENERATION/WORKER: ${GENERATION}/${WORKER}
WORKERLIST[${GENERATION}]=$(sbatch
--qos=rc-normal -o /dev/null -J
${BASEJOBNAME}-${GENERATION}-${WORKER} ${JOBFILE} | awk ' { print $4
}'):${WORKERLIST[${GENERATION}]}
done
else
for WORKER in $(seq 1 ${NUMBEROFWORKERS}) ; do
echo GENERATION/WORKER: ${GENERATION}/${WORKER}
WORKERLIST[${GENERATION}]=$(sbatch
--qos=rc-normal -o /dev/null
--dependency=afterok:${WORKERLIST[$(expr ${GENERATION} - 1)]%\:} -J
${BASEJOBNAME}-${GENERATION}-${WORKER} ${JOBFILE} | awk ' { print $4
}'):${WORKERLIST[${GENERATION}]}
done
fi
done
#!/bin/bash -l
# NOTE the -l flag!
#
# Where to send mail...
# notify on state change: BEGIN, END, FAIL or ALL
#SBATCH --mail-type=FAIL
# Request run time MAX H:M:S , anything over will be KILLED
#SBATCH -t 0:1:30
#vaild partions are "work" and "debug"
#SBATCH -p work -n 1
# Job memory requirements in MB
#SBATCH --mem=30
#Just a quick sleep.
sleep 60
####END CONSOLE DUMP####
https://github.com/paulmezz/SlurmThings
I know there are ways I could clean up the loops but for this test I just
don't care :)
Any ideas? (and thanks!)
-paul
--
Sr Systems Administrator/Engineer
Research Computing at RIT
585.475.3245