Slurm, RHEL6, cgroups and not constraining memory

Discussion:

Christopher Samuel

2013-01-18 06:05:02 UTC

Hi folks,

I'm playing with Slurm 2.5.1 on a RHEL6.3 box and trying to get it to
limit the amount of memory a job needs. To demonstrate the issue I've
got a program that just loops allocating RAM in 1GB chunks.

My /etc/slurm/cgroup.conf has:

CgroupAutomount=yes
CgroupReleaseAgentDir="/etc/slurm/cgroup"

ConstrainCores=yes
ConstrainRAMSpace=yes
ConstrainSwapSpace=yes

Slurm has all the cgroup plugins enabled:

ProctrackType=proctrack/cgroup
TaskPlugin=task/cgroup
JobAcctGatherType=jobacct_gather/cgroup

I've also confirmed that cgroups are being created and destroyed for
these jobs too.

By default (with overcommit disabled) I get:

[***@qan02 Run1]# srun -n1 --mem=4G ./memtest
Malloc failed: Cannot allocate memory
Allocated 1 GB
Allocated 2 GB
Allocated 3 GB
Allocated 4 GB
Allocated 5 GB
Allocated 6 GB
Allocated 7 GB
Allocated 8 GB
Allocated 9 GB
Allocated 10 GB
Allocated 11 GB
Allocated 12 GB
Allocated 13 GB
Allocated 14 GB
Allocated 15 GB
Allocated 16 GB
Allocated 17 GB
Allocated 18 GB
Allocated 19 GB
Allocated 20 GB
Allocated 21 GB
Allocated 22 GB
Allocated 23 GB
Allocated 24 GB
Allocated 25 GB
Allocated 26 GB
Allocated 27 GB
Allocated 28 GB
Allocated 29 GB
Allocated 30 GB
Allocated 31 GB
Allocated 32 GB
Allocated 33 GB
Allocated 34 GB
Allocated 35 GB
srun: error: qan02: task 0: Exited with exit code 1

Which shows that it can request as much as the machine has, despite
Slurm being told it only wants 4GB of RAM.

Poking at the cgroup when I've made it sleep between allocation I see
that the memory cgroup doesn't appear to have any limits set on its usage:

memory.limit_in_bytes
9223372036854775807

memory.memsw.limit_in_bytes
9223372036854775807

memory.soft_limit_in_bytes
9223372036854775807

Which would explain a lot.

Any ideas?

cheers!
Chris
- --
Christopher Samuel Senior Systems Administrator
VLSCI - Victorian Life Sciences Computation Initiative
Email: samuel-***@public.gmane.org Phone: +61 (0)3 903 55545
http://www.vlsci.org.au/ http://twitter.com/vlsci

Bjørn-Helge Mevik

2013-01-18 08:37:03 UTC

Permalink

Post by Christopher Samuel
Which shows that it can request as much as the machine has, despite
Slurm being told it only wants 4GB of RAM.

I don't know if this is the reason in your case, but note that cgroup
in slurm constrains _resident_ RAM, not _allocated_ ("virtual") RAM.

Try filling the allocated memory with some values, and you will probably
see that after filling 4 GiB, the job is killed.

--
Regards,
Bjørn-Helge Mevik, dr. scient,
Research Computing Services, University of Oslo

Christopher Samuel

2013-01-18 22:32:03 UTC

Permalink

Hiya!

Post by BjÃ¸rn-Helge Mevik
I don't know if this is the reason in your case, but note that cgroup
in slurm constrains_resident_ RAM, not_allocated_ ("virtual") RAM.

Hmm, as a sysadmin that doesn't seem very useful, you want it to
constrain how much memory the application can allocate so that it can
learn it has hit a limit when malloc() fails (and hopefully gracefully
report/recover).

We do this in our cpuset based (relatively old) Torque install by making
it set RLIMIT_AS instead of RLIMIT_DATA to enforce memory requests as
current implementations of malloc() in glibc use mmap() rather than
brk() for any non-trivial allocation and mmap() only honours RLIMIT_AS,
not RLIMIT_DATA.

That's not perfect though as the user could launch multiple processes,
each of which can allocate up to RLIMIT_AS. Hence our interest in
cgroups and their ability to set a limit on an entire job.

Post by BjÃ¸rn-Helge Mevik
Try filling the allocated memory with some values, and you will probably
see that after filling 4 GiB, the job is killed.

But we don't want the job to be killed, we want it to find out that it's
hit its memory limit. An application should only be able to allocate
the amount of memory the batch job has requested.

cheers!
Chris

--
Christopher Samuel Senior Systems Administrator
VLSCI - Victorian Life Sciences Computation Initiative
Email: samuel-***@public.gmane.org Phone: +61 (0)3 903 55545
http://www.vlsci.org.au/ http://twitter.com/vlsci

Bjørn-Helge Mevik

2013-01-21 09:20:15 UTC

Permalink

Post by Christopher Samuel

Post by BjÃ¸rn-Helge Mevik
I don't know if this is the reason in your case, but note that cgroup
in slurm constrains_resident_ RAM, not_allocated_ ("virtual") RAM.

Hmm, as a sysadmin that doesn't seem very useful,

Hmm, as a sysadmin I must say that I disagree. :)

Post by Christopher Samuel
you want it to constrain how much memory the application can allocate
so that it can learn it has hit a limit when malloc() fails (and
hopefully gracefully report/recover).

What the best way to constrain memory is, is very much dependent on how
the cluster is set up and what type of jobs are run on it, IMO.

A problem with limiting the virtual memory allocations, is that with
recent versions of glibc, the amount of VMEM that a threaded application
allocates is much, much bigger than what it is ever going to use. For
instance, on our master node, slurmctld uses about 50 MiB RAM
(resident), but the VMEM usage reported by ps or top is 16 GiB(!). This
is the reason we switched to using cgroups.

As for letting cgroups notify the job instead of killing it, that is
probably hard to implement, because the cgroups limiting is done by the
kernel itself, not slurm, and I at least don't know of any
callback-hooks or other features in cgroups that could be used for such a
thing.

--
Cheers,
Bjørn-Helge Mevik, dr. scient,
Research Computing Services, University of Oslo

David Bigagli

2013-01-21 14:19:03 UTC

Permalink

Perhaps an easy approach is to set RLIMIT_AS in the job itself or in its
wrapper, then allow the application to handle ENOMEM error.

Post by BjÃ¸rn-Helge Mevik

Post by Christopher Samuel

Post by BjÃ¸rn-Helge Mevik
I don't know if this is the reason in your case, but note that cgroup
in slurm constrains_resident_ RAM, not_allocated_ ("virtual") RAM.

Hmm, as a sysadmin that doesn't seem very useful,

Hmm, as a sysadmin I must say that I disagree. :)

What the best way to constrain memory is, is very much dependent on how
the cluster is set up and what type of jobs are run on it, IMO.
A problem with limiting the virtual memory allocations, is that with
recent versions of glibc, the amount of VMEM that a threaded application
allocates is much, much bigger than what it is ever going to use. For
instance, on our master node, slurmctld uses about 50 MiB RAM
(resident), but the VMEM usage reported by ps or top is 16 GiB(!). This
is the reason we switched to using cgroups.
As for letting cgroups notify the job instead of killing it, that is
probably hard to implement, because the cgroups limiting is done by the
kernel itself, not slurm, and I at least don't know of any
callback-hooks or other features in cgroups that could be used for such a
thing.

--
/David

Christopher Samuel

2013-01-22 00:04:02 UTC

Permalink

Post by David Bigagli
Perhaps an easy approach is to set RLIMIT_AS in the job itself or
in its wrapper, then allow the application to handle ENOMEM error.

This is what we do already in Torque (via a local patch), the only
wrinkle there being that a job script can launch N processes each of
which can allocate up to RLIMIT_AS.

We were hoping that Slurms cgroups support would permit limiting the
memory allocated by the whole job.

cheers,
Chris
- --
Christopher Samuel Senior Systems Administrator
VLSCI - Victorian Life Sciences Computation Initiative
Email: samuel-***@public.gmane.org Phone: +61 (0)3 903 55545
http://www.vlsci.org.au/ http://twitter.com/vlsci

Mark A. Grondona

2013-01-22 16:57:04 UTC

Permalink

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Post by David Bigagli
Perhaps an easy approach is to set RLIMIT_AS in the job itself or
in its wrapper, then allow the application to handle ENOMEM error.

If you want to limit allocations rather than resident memory, you
can disable overcommit (i.e. set /proc/sys/vm/overcommit_memory to 2),
but afaik in RHEL6 you have to do this for the system as a
whole. Disabling overcommit will make calls to malloc() fail immediately
if too much memory is already ~committed~ to other processes, so you
may end up having a lot less RAM than you expect depending on the
behavior of other processes on the system. If your nodes run a single
job then this is less of a problem.

mark

cheers,
Chris
- --
Christopher Samuel Senior Systems Administrator
VLSCI - Victorian Life Sciences Computation Initiative
http://www.vlsci.org.au/ http://twitter.com/vlsci
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/
iEYEARECAAYFAlD91LIACgkQO2KABBYQAh+nuwCgk9387NjBHv0sb2PHHYKBP4Sw
XqwAmwXStmfvAyu+XLE258VOre27FK+5
=6EUk
-----END PGP SIGNATURE-----

Christopher Samuel

2013-01-23 04:35:03 UTC

Permalink

Post by Mark A. Grondona
If you want to limit allocations rather than resident memory, you
can disable overcommit (i.e. set /proc/sys/vm/overcommit_memory to 2),

We already do that to avoid having nodes hang due to the annoying OOM
killer lockups you can get. Haven't had a deadlock like that for
years now.

Post by Mark A. Grondona
but afaik in RHEL6 you have to do this for the system as a whole.

Yeah, that's a sysctl for the kernel, we run with:

# Stop nodes OOM'ing
vm.overcommit_memory = 2
vm.overcommit_ratio = 93

Post by Mark A. Grondona
Disabling overcommit will make calls to malloc() fail immediately
if too much memory is already ~committed~ to other processes, so
you may end up having a lot less RAM than you expect depending on
the behavior of other processes on the system. If your nodes run a
single job then this is less of a problem.

What we want (and what we get as a per process limit with Torque with
our RLIMIT_AS patch) is if a user requests a job with 1GB of RAM then
any attempt by a program to malloc() more than that should cause
malloc() to fail.

I'm pretty sure that's meant to be possible via cgroups.

cheers,
Chris
- --
Christopher Samuel Senior Systems Administrator
VLSCI - Victorian Life Sciences Computation Initiative
Email: samuel-***@public.gmane.org Phone: +61 (0)3 903 55545
http://www.vlsci.org.au/ http://twitter.com/vlsci

Mark A. Grondona

2013-01-23 19:50:05 UTC

Permalink

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Post by Mark A. Grondona
If you want to limit allocations rather than resident memory, you
can disable overcommit (i.e. set /proc/sys/vm/overcommit_memory to 2),

We already do that to avoid having nodes hang due to the annoying OOM
killer lockups you can get. Haven't had a deadlock like that for
years now.

Post by Mark A. Grondona
but afaik in RHEL6 you have to do this for the system as a whole.

# Stop nodes OOM'ing
vm.overcommit_memory = 2
vm.overcommit_ratio = 93

Yes, but unfortunately it appears that system overcommit setting
does not affect the functionality of the memory cgroups implementation,
which *only* limits resident memory. So you can think of the
current memory cgroup implementation as affecting rss only.
This is what most people are interested in because on systems
these days you have mostly limitless adress space.

I'm pretty sure that's meant to be possible via cgroups.

It looks like back in 2008[1] there was an attempt to create a
"memrlimit" cgroup controller that did what you want, but
for whatever reason it was apparently never merged into the
kernel. So far the developers have expressly sought to *only*
limit rss with a cgroup controller, afaics.

mark

[1] http://lwn.net/Articles/283287/

cheers,
Chris
- --
Christopher Samuel Senior Systems Administrator
VLSCI - Victorian Life Sciences Computation Initiative
http://www.vlsci.org.au/ http://twitter.com/vlsci
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/
iEYEARECAAYFAlD/XtIACgkQO2KABBYQAh9DjgCePdsAwg42Z8ZFyULgNze7UF9+
sxgAn2ChTTtmQcvPOEkdv/8yy2x+QWEl
=gksu
-----END PGP SIGNATURE-----

Christopher Samuel

2013-02-04 05:35:04 UTC

Permalink

Post by Mark A. Grondona
Yes, but unfortunately it appears that system overcommit setting
does not affect the functionality of the memory cgroups
implementation, which *only* limits resident memory. So you can
think of the current memory cgroup implementation as affecting rss
only.

Apologies for the delay, been having some hardware fun recently.

You are indeed quite right, and there is no check in the mmap() or
sbrk() path to see if that would exceed the limit, only resource
limits are checked (RLIMIT_AS being the important one).

Sorry for the confusion!

Here's a query from the Open Grid Engine folks last June on the same
issue, being told that it's not implemented yet:

https://lkml.org/lkml/2012/6/12/54

Hey ho..

All the best!
Chris
- --
Christopher Samuel Senior Systems Administrator
VLSCI - Victorian Life Sciences Computation Initiative
Email: samuel-***@public.gmane.org Phone: +61 (0)3 903 55545
http://www.vlsci.org.au/ http://twitter.com/vlsci

Mark A. Grondona

2013-01-22 18:11:05 UTC

Permalink

Post by BjÃ¸rn-Helge Mevik
As for letting cgroups notify the job instead of killing it, that is
probably hard to implement, because the cgroups limiting is done by the
kernel itself, not slurm, and I at least don't know of any
callback-hooks or other features in cgroups that could be used for such a
thing.

In upstream kernels there is already a feature for setting up
memcg notifications, and the majority of this should be backported
to RHEL as of RHEL6.4.

http://www.mjmwired.net/kernel/Documentation/cgroups/memory.txt

See Section 9 in the link above. An application can register for
notification of when a usage_in_bytes threshold is crossed using
the event_fd() mechanism. This can be useful for applications that
have the ability to take some action when memory is tight (e.g.
release some cache, shrink bufferes, etc.)

Note that RHEL6.4 should also have the feature described in Sec 10,
i.e. the ability to disable the oom killer for a memcg.

mark

Post by BjÃ¸rn-Helge Mevik
--
Cheers,
Bjørn-Helge Mevik, dr. scient,
Research Computing Services, University of Oslo

Bjørn-Helge Mevik

2013-01-23 12:43:03 UTC

Permalink

Post by Mark A. Grondona
In upstream kernels there is already a feature for setting up
memcg notifications, and the majority of this should be backported
to RHEL as of RHEL6.4.

This is very good news! It is always a nuisance when users' jobs get
killed and there is no message about why in slurm-nnn.out. Then we get
an RT ticket, and have to grep in /var/log/messages.

--
Cheers,
Bjørn-Helge Mevik, dr. scient,
Research Computing Services, University of Oslo

Mark A. Grondona

2013-01-23 20:24:02 UTC

Permalink

Post by BjÃ¸rn-Helge Mevik

Post by Mark A. Grondona
In upstream kernels there is already a feature for setting up
memcg notifications, and the majority of this should be backported
to RHEL as of RHEL6.4.

This is very good news! It is always a nuisance when users' jobs get
killed and there is no message about why in slurm-nnn.out. Then we get
an RT ticket, and have to grep in /var/log/messages.
--

We use the following spank/lua plugin which makes a best effort attempt
to notify users when one of their tasks is killed by the OOM killer.

http://code.google.com/p/slurm-spank-plugins/source/browse/lua/oom-detect.lua

It just does a grep of dmesg output, so it isn't perfect. Things
will be much better with oom notifications.

mark

Post by BjÃ¸rn-Helge Mevik
Cheers,
Bjørn-Helge Mevik, dr. scient,
Research Computing Services, University of Oslo