Discussion:
overcounting of SysV shared memory segments?
Riccardo Murri
2014-09-19 16:01:32 UTC
Permalink
Hello,

we are having an issue with SLURM killing jobs because of virtual
memory limits::

slurmstepd[46530]: error: Job 784 exceeded virtual memory limit
(416329820 > 211812352), being killed

The problem is that the job above has actually negligible heap use,
*but* it allocates a SysV shared memory segment of about 100GB. It
seems that the size of this shared memory segment is counted towards
*all* 4 processes in the job, instead of being counted just once.

Is this expected, or did we misconfigure something?

We are running 14.03.2. Possibly relevant configuration items::

# slurm.conf
JobAcctGatherType=jobacct_gather/linux
JobCompType=jobcomp/none
MpiDefault=none
ProctrackType=proctrack/pgid
PropagateResourceLimitsExcept=CPU
SelectType=select/cons_res
SelectTypeParameters=CR_Core_Memory
TaskPlugin=task/cgroup
VSizeFactor=101

# cgroup.conf
ConstrainCores=yes

Thanks for any suggestion!

Kind regards,
Riccardo

--
Riccardo Murri
http://www.s3it.uzh.ch/about/team/

S3IT: Services and Support for Science IT
University of Zurich
Winterthurerstrasse 190, CH-8057 Zürich (Switzerland)
Tel: +41 44 635 4222
Fax: +41 44 635 6888
Christopher Samuel
2014-09-19 17:14:34 UTC
Permalink
This post might be inappropriate. Click to display it.
Trey Dockendorf
2014-09-19 17:56:35 UTC
Permalink
This post might be inappropriate. Click to display it.
Carlos Fenoy
2014-09-23 13:57:33 UTC
Permalink
Hi,

We had some users complaining for the same behaviour but for resident
memory. What we did was modify the accounting plugin to consider the
proportional set size (PSS) instead of RSS. This way the shared memory is
accounted only one, but proportionally for each process, so if 2 processed
share a 4MB segment each is accounted 2MB. Maybe a similar approach can be
used in this case.

Regards,
Carles Fenoy
Post by Riccardo Murri
Hello,
we are having an issue with SLURM killing jobs because of virtual
slurmstepd[46530]: error: Job 784 exceeded virtual memory limit
(416329820 > 211812352), being killed
The problem is that the job above has actually negligible heap use,
*but* it allocates a SysV shared memory segment of about 100GB. It
seems that the size of this shared memory segment is counted towards
*all* 4 processes in the job, instead of being counted just once.
Is this expected, or did we misconfigure something?
# slurm.conf
JobAcctGatherType=jobacct_gather/linux
JobCompType=jobcomp/none
MpiDefault=none
ProctrackType=proctrack/pgid
PropagateResourceLimitsExcept=CPU
SelectType=select/cons_res
SelectTypeParameters=CR_Core_Memory
TaskPlugin=task/cgroup
VSizeFactor=101
# cgroup.conf
ConstrainCores=yes
Thanks for any suggestion!
Kind regards,
Riccardo
--
Riccardo Murri
http://www.s3it.uzh.ch/about/team/
S3IT: Services and Support for Science IT
University of Zurich
Winterthurerstrasse 190, CH-8057 Zürich (Switzerland)
Tel: +41 44 635 4222
Fax: +41 44 635 6888
--
--
Carles Fenoy
Loading...