Questions about the task/cgroup plugin

Discussion:

Bjørn-Helge Mevik

2012-09-04 10:42:05 UTC

We've switched to use the task/cgroup plugin to constrain the memory
usage on our cluster. (Slurm 2.4.1, Rocks 6.0 based on CentOS 6.2)

We have the following cgroup.conf:

-------------
###
### General settings
###

CgroupMountpoint=/dev/cgroup
CgroupAutomount=yes
#default: CgroupReleaseAgentDir=/etc/slurm/cgroup

###
### Task/cgroup plugin
###

#default: ConstrainCores=no
#default: TaskAffinity=no
ConstrainRAMSpace=yes
#deafult: ConstrainDevices=no
#default: AllowedDevicesFile=/etc/slurm/cgroup_allowed_devices_file.conf
-------------

I'm quite new to cgroups, so please forgive me if these are silly
questions:

******
1) Does the ConstrainRAMSpace kill a process when the job uses too much
RAM (resident), or too much RAM (resident) + swap?

I did a test on a node with 64530 MB RAM (according to "free -m). The
node is configured in slurm to have 64530 MB RAM, and I ran a job with
--ntasks=1 --mem-per-cpu=64530.

The job started a C program that allocated a 65536 MB vector, and then
started to fill it (i.e., actually use it). The program was killed by
the oom-killer on the node, and /var/log/messages contained the
following:

Sep 4 11:43:53 compute-1-1 kernel: Task in /slurm/uid_10231/job_344/step_4294967294 killed as a result of limit of /slurm/uid_10231/job_344/step_4294967294
Sep 4 11:43:53 compute-1-1 kernel: memory: usage 64447376kB, limit 66078720kB, failcnt 0
Sep 4 11:43:53 compute-1-1 kernel: memory+swap: usage 66078720kB, limit 66078720kB, failcnt 49

To me, this looks like the job used 62936.89 MB (resident) memory, which is
less than the limit (64530 MB), but it used 64530 MB memory + swap,
which equals the limit, so it was killed.

Am I correct in this interpretation?

(It is not a problem if this is correct; we just want to be sure what
actually happens. If cgroup only constrained resident memory, one would
think that this program would not be killed, because it would never be
able to get 64530 MB resident (experiments have shown that the limit is
about 62894 MB on these nodes).)

*******
2) Is it possible to get slurm to write a message to the job's stderr
(i.e., slurm-xxx.out) when a process is killed due to a task/cgroup
limit?

*******
3) The oom-killer is very talkative: Killing the process above resulted
in about 200 lines in /var/log/messages. Is there a way to reduce the
"chatter" a bit (but not turn of loggin alltogether)?

(Any other comments and suggestions about the task/cgroup use are
also welcome!)

--
Regards,
Bjørn-Helge Mevik, dr. scient,
Research Computing Services, University of Oslo

Mark A. Grondona

2012-09-04 22:02:04 UTC

Permalink

Post by BjÃ¸rn-Helge Mevik
We've switched to use the task/cgroup plugin to constrain the memory
usage on our cluster. (Slurm 2.4.1, Rocks 6.0 based on CentOS 6.2)
-------------
###
### General settings
###
CgroupMountpoint=/dev/cgroup
CgroupAutomount=yes
#default: CgroupReleaseAgentDir=/etc/slurm/cgroup
###
### Task/cgroup plugin
###
#default: ConstrainCores=no
#default: TaskAffinity=no
ConstrainRAMSpace=yes
#deafult: ConstrainDevices=no
#default: AllowedDevicesFile=/etc/slurm/cgroup_allowed_devices_file.conf
-------------
I'm quite new to cgroups, so please forgive me if these are silly
******
1) Does the ConstrainRAMSpace kill a process when the job uses too much
RAM (resident), or too much RAM (resident) + swap?
I did a test on a node with 64530 MB RAM (according to "free -m). The
node is configured in slurm to have 64530 MB RAM, and I ran a job with
--ntasks=1 --mem-per-cpu=64530.
The job started a C program that allocated a 65536 MB vector, and then
started to fill it (i.e., actually use it). The program was killed by
the oom-killer on the node, and /var/log/messages contained the
Sep 4 11:43:53 compute-1-1 kernel: Task in /slurm/uid_10231/job_344/step_4294967294 killed as a result of limit of /slurm/uid_10231/job_344/step_4294967294
Sep 4 11:43:53 compute-1-1 kernel: memory: usage 64447376kB, limit 66078720kB, failcnt 0
Sep 4 11:43:53 compute-1-1 kernel: memory+swap: usage 66078720kB, limit 66078720kB, failcnt 49
To me, this looks like the job used 62936.89 MB (resident) memory, which is
less than the limit (64530 MB), but it used 64530 MB memory + swap,
which equals the limit, so it was killed.
Am I correct in this interpretation?
(It is not a problem if this is correct; we just want to be sure what
actually happens. If cgroup only constrained resident memory, one would
think that this program would not be killed, because it would never be
able to get 64530 MB resident (experiments have shown that the limit is
about 62894 MB on these nodes).)

There will be detailed documentation regarding memory cgroups in
the Documentation for your kernel (or, the latest documentation is
here

http://www.kernel.org/doc/Documentation/cgroups/memory.txt

if you are using a recent kernel) I'm sure you have also read through
the cgroup.conf(5) manpage in SLURM.

However, I am actually wondering if what you are seeing isn't a little
bug in the slurm task/cgroup plugin.

SLURM sets memory.limit_in_bytes to the allocated memory for
the job step * (AllowedRAMSpace/100). In your case it looks like
you are using the default AllowedRAMSpace, which is 100, so
memory.limit_in_bytes = 64530. So it does appear that the limit
was set correctly. However, SLURM should only set
memory.memsw.limit_in_bytes if you have ConstrainSwapSpace=yes,
which you do not appear to have done. Therefore, the memory
cgroup should not have killed your job until the actual resident
memory usage went above the limit.

In the slurmd log for the job there should be a line of output
which details the settings that slurm is applying to the job step
memory cgroup. it should look something like

task/cgroup: /cgroup/memory/slurm...: alloc=xxMB...

You could check that line to verify things are set as you expect.

Post by BjÃ¸rn-Helge Mevik
*******
2) Is it possible to get slurm to write a message to the job's stderr
(i.e., slurm-xxx.out) when a process is killed due to a task/cgroup
limit?

I have a spank plugin that essentially greps the dmesg output
after job completion and issues such a message to the stderr of
the job if a task has been terminated by the OOM killer. It is
not perfect, but works 90% of the time. I can send it to you if
you like.

Post by BjÃ¸rn-Helge Mevik
*******
3) The oom-killer is very talkative: Killing the process above resulted
in about 200 lines in /var/log/messages. Is there a way to reduce the
"chatter" a bit (but not turn of loggin alltogether)?

This is the default output for the oom-killer. I'm not sure if there
is a way to quiet the output, but keep in mind it is very handy to
have this ouput around when something unexpected happens, and you
have to go back and figure out what exactly triggered the oom-killer,
and why it decided to choose such-and-such a process.

However, in recent kernels (and as of RHEL6.3) there is now
the ability to disable the oom-killer on a per-cgroup basis. Instead
of having the oom killer go off when the cgroup memory limit is
exceeded, the kernel will stop all processes in the cgroup, and
signal another process waiting on an eventfd (obviously not in
the cgroup). This could be used by slurm or a slurm plugin to
gather memory usage data from the cgroup, terminate all processes
in the job, and write a cogent message to stderr for the user.

I haven't experimented with this interface yet.

mark

Post by BjÃ¸rn-Helge Mevik
(Any other comments and suggestions about the task/cgroup use are
also welcome!)
--
Regards,
Bjørn-Helge Mevik, dr. scient,
Research Computing Services, University of Oslo

Bjørn-Helge Mevik

2012-09-05 08:58:03 UTC

Permalink

Post by Mark A. Grondona
There will be detailed documentation regarding memory cgroups in
the Documentation for your kernel (or, the latest documentation is
here
http://www.kernel.org/doc/Documentation/cgroups/memory.txt

Thanks! I'm going to go through that.

Post by Mark A. Grondona
I'm sure you have also read through the cgroup.conf(5) manpage in
SLURM.

Yes. :) After I read it, I still wasn't certain what the
different limits actually were limiting and how the Allowed* and
Constrain* interact.

Post by Mark A. Grondona
SLURM sets memory.limit_in_bytes to the allocated memory for

[...]

Thanks for that description. That made it much more clearer to me how
task/cgroup works (or should work. :).

Post by Mark A. Grondona
In the slurmd log for the job there should be a line of output
which details the settings that slurm is applying to the job step
memory cgroup.

For my job, it says:

task/cgroup/memory: total:64530M allowed:100%, swap:0%, max:100%(64530M) max+swap:100%(129060M) min:30M

task/cgroup: /slurm/uid_10231/job_344: alloc=64530MB mem.limit=64530MB memsw.limit=64530MB
task/cgroup: /slurm/uid_10231/job_344/step_4294967294: alloc=64530MB mem.limit=64530MB memsw.limit=64530MB

Which I guess means that both memory.limit_in_bytes and
memory.memsw.limit_in_bytes are set to 64530MB.

Post by Mark A. Grondona
I have a spank plugin that essentially greps the dmesg output
after job completion and issues such a message to the stderr of
the job if a task has been terminated by the OOM killer. It is
not perfect, but works 90% of the time. I can send it to you if
you like.

Yes, I'd very much like that! Jobs killed by memory limit is quite
common on our cluster, and users get confused if there is no message
telling them why the job died.

Thanks for a very informative answer!

--
Regards,
Bjørn-Helge Mevik, dr. scient,
Research Computing Services, University of Oslo

Mark A. Grondona

2012-09-05 17:12:04 UTC

Permalink

Post by BjÃ¸rn-Helge Mevik

Thanks! I'm going to go through that.

Post by Mark A. Grondona
I'm sure you have also read through the cgroup.conf(5) manpage in
SLURM.

Yes. :) After I read it, I still wasn't certain what the
different limits actually were limiting and how the Allowed* and
Constrain* interact.

Post by Mark A. Grondona
SLURM sets memory.limit_in_bytes to the allocated memory for

[...]
Thanks for that description. That made it much more clearer to me how
task/cgroup works (or should work. :).

Post by Mark A. Grondona
In the slurmd log for the job there should be a line of output
which details the settings that slurm is applying to the job step
memory cgroup.

task/cgroup/memory: total:64530M allowed:100%, swap:0%, max:100%(64530M) max+swap:100%(129060M) min:30M
task/cgroup: /slurm/uid_10231/job_344: alloc=64530MB mem.limit=64530MB memsw.limit=64530MB
task/cgroup: /slurm/uid_10231/job_344/step_4294967294: alloc=64530MB mem.limit=64530MB memsw.limit=64530MB
Which I guess means that both memory.limit_in_bytes and
memory.memsw.limit_in_bytes are set to 64530MB.

Hm, that doesn't seem to agree with the documentation, so I would
assume that is a bug. Either the docs or the code should be updated.
Anyhow, if you'd like to allow jobs to use a greater amount of
memory+swap, for now you could increase AllowedSwapSpace to something
like 50 (which would mean that memory+swap would be 50% larger than
memory alone)

Post by BjÃ¸rn-Helge Mevik

Yes, I'd very much like that! Jobs killed by memory limit is quite
common on our cluster, and users get confused if there is no message
telling them why the job died.

Ok, I have added an "example" lua plugin to the slurm-spank-plugins
project on google code called oom-detect.lua. You can browse the
code here:

http://code.google.com/p/slurm-spank-plugins/source/browse/lua/oom-detect.lua

The plugin requires the lua-posix module, available here:

https://github.com/luaposix/luaposix

or potentially it may be distributed with your distro. If you have
trouble getting or installing the luaposix module, we could remove
the posix dependency with only a small loss of functionality.

As mentioned in the comments at the top, you may have to update the
pattern match for your kernel. If you send me the line of dmesg output
for an OOM killed task, I may be able to help with that.

mark

Post by BjÃ¸rn-Helge Mevik
Thanks for a very informative answer!
--
Regards,
Bjørn-Helge Mevik, dr. scient,
Research Computing Services, University of Oslo

Bjørn-Helge Mevik

2012-09-06 13:37:04 UTC

Permalink

Post by Mark A. Grondona
Hm, that doesn't seem to agree with the documentation, so I would
assume that is a bug.

Yes, especially since ConstrainSwapSpace seems to constrain RAM+swap
according to the documentation.

Post by Mark A. Grondona
Either the docs or the code should be updated.
Anyhow, if you'd like to allow jobs to use a greater amount of
memory+swap, for now you could increase AllowedSwapSpace to something
like 50 (which would mean that memory+swap would be 50% larger than
memory alone)

Thanks for the tip.

Post by Mark A. Grondona
Ok, I have added an "example" lua plugin to the slurm-spank-plugins
project on google code called oom-detect.lua. You can browse the

Thanks!

--
Cheers,
Bjørn-Helge Mevik, dr. scient,
Research Computing Services, University of Oslo

Matthieu Hautreux

2012-09-11 23:08:05 UTC

Permalink

Post by BjÃ¸rn-Helge Mevik

Post by Mark A. Grondona
Hm, that doesn't seem to agree with the documentation, so I would
assume that is a bug.

Yes, especially since ConstrainSwapSpace seems to constrain RAM+swap
according to the documentation.

I agree that this is a bug. The ConstrainSwapSpace is not checked
before setting the RAM+Swap limit thus it is always applied. This
explains why a workaround is to set AllowedSwapSpace to 50 or
something else sufficient.

The same applies for ConstrainRAMSpace that is always enabled even if
only ConstrainSwapSpace is enabled in the conf. The modification
should be pretty simple to do (add 2 checks in
task_cgroup_memory.c:memcg_initialize). I will try to propose
something tomorrow.

Regards,
Matthieu

Post by BjÃ¸rn-Helge Mevik

Thanks for the tip.

Post by Mark A. Grondona
Ok, I have added an "example" lua plugin to the slurm-spank-plugins
project on google code called oom-detect.lua. You can browse the

Thanks!
--
Cheers,
Bjørn-Helge Mevik, dr. scient,
Research Computing Services, University of Oslo

Matthieu Hautreux

2012-09-12 20:58:07 UTC

Permalink

You will find enclosed a proposal of a patch to solve the issue.

Let me know if the logic sounds good to you concerning the way the
ConstrainRAMSpace=no and ConstrainSwapSpace=yes combination is
treated.

We should plan to study the eventfd logic to provide an option in
cgroup.conf to specifically ask to preempt the oom-killer for the
steps when possible. It would definitely be a good thing to have a
deterministic way to detect the overrun.

Regards,
Matthieu

Post by Matthieu Hautreux

Post by BjÃ¸rn-Helge Mevik

Post by Mark A. Grondona
Hm, that doesn't seem to agree with the documentation, so I would
assume that is a bug.

Yes, especially since ConstrainSwapSpace seems to constrain RAM+swap
according to the documentation.

I agree that this is a bug. The ConstrainSwapSpace is not checked
before setting the RAM+Swap limit thus it is always applied. This
explains why a workaround is to set AllowedSwapSpace to 50 or
something else sufficient.
The same applies for ConstrainRAMSpace that is always enabled even if
only ConstrainSwapSpace is enabled in the conf. The modification
should be pretty simple to do (add 2 checks in
task_cgroup_memory.c:memcg_initialize). I will try to propose
something tomorrow.
Regards,
Matthieu

Post by BjÃ¸rn-Helge Mevik

Thanks for the tip.

Post by Mark A. Grondona
Ok, I have added an "example" lua plugin to the slurm-spank-plugins
project on google code called oom-detect.lua. You can browse the

Thanks!
--
Cheers,
Bjørn-Helge Mevik, dr. scient,
Research Computing Services, University of Oslo

Moe Jette

2012-09-13 19:12:07 UTC

Permalink

This change will be in version 2.4.3, which we hope to release within
a few days.

Thanks you!

Post by Matthieu Hautreux
You will find enclosed a proposal of a patch to solve the issue.
Let me know if the logic sounds good to you concerning the way the
ConstrainRAMSpace=no and ConstrainSwapSpace=yes combination is
treated.
We should plan to study the eventfd logic to provide an option in
cgroup.conf to specifically ask to preempt the oom-killer for the
steps when possible. It would definitely be a good thing to have a
deterministic way to detect the overrun.
Regards,
Matthieu

Post by Matthieu Hautreux

Post by BjÃ¸rn-Helge Mevik

Post by Mark A. Grondona
Hm, that doesn't seem to agree with the documentation, so I would
assume that is a bug.

Yes, especially since ConstrainSwapSpace seems to constrain RAM+swap
according to the documentation.

I agree that this is a bug. The ConstrainSwapSpace is not checked
before setting the RAM+Swap limit thus it is always applied. This
explains why a workaround is to set AllowedSwapSpace to 50 or
something else sufficient.
The same applies for ConstrainRAMSpace that is always enabled even if
only ConstrainSwapSpace is enabled in the conf. The modification
should be pretty simple to do (add 2 checks in
task_cgroup_memory.c:memcg_initialize). I will try to propose
something tomorrow.
Regards,
Matthieu

Post by BjÃ¸rn-Helge Mevik

Thanks for the tip.

Post by Mark A. Grondona
Ok, I have added an "example" lua plugin to the slurm-spank-plugins
project on google code called oom-detect.lua. You can browse the

Thanks!
--
Cheers,
Bjørn-Helge Mevik, dr. scient,
Research Computing Services, University of Oslo