Intel MPI Performance inconsistency (and workaround)

Discussion:

Jesse Stroik

2014-08-21 18:42:35 UTC

Slurmites,

We recently noticed sporadic performance inconsistencies on one of our
clusters. We discovered that if we restarted slurmd in an interactive
shell, we observed correct performance.

To track down the cause, we ran:

(1) single-node linpack
(2) dual node mp_linpack
(3) mpptest

On affected nodes, Linpack performance was normal and mp_linpack was
about 85% as high as expected.

mpptest, which measures MPI performance, was our smoking gun. Latencies
to be 10x higher than expected (~20us instead of < 2us). We were able to
consistently reproduce the issue with freshly imaged or freshly rebooted
nodes. Upon restarting slurmd on each execution node manually, MPI
latencies immediately improved to the expected < 2us for our set of
tested nodes.

The cluster is under fairly heavy use right now so we don't have the
luxury of diagnosing this thoroughly and determining the cause. We
wanted to share this experience with others in case it can help other
users or if any slurm developers would like us to file a bug report and
be interested in gathering further information.

Best,
Jesse Stroik

University of Wisconsin

Kilian Cavalotti

2014-08-21 19:14:35 UTC

Permalink

Hi Jesse,

Just a shot in the dark, but do you use task affinity or CPU binding?

Cheers,

--
Kilian

Jesse Stroik

2014-08-21 19:22:41 UTC

Permalink

Yes, but we aren't specifying it for all of these jobs. In the config we
have:

-----------
TaskPlugin=task/affinity
TaskPluginParam=Sched
SelectTypeParameters=CR_CPU_Memory,CR_CORE_DEFAULT_DIST_BLOCK
-----------

And we typically suggest "--cpu_bind=core --distribution=block:block"
for srun in our documentation. However, we did not specify --cpu_bind or
--distribution as arguments to the job for mpptest or for mp_linpack.
And we noticed that despite the CR_CORE_DEFAULT_DIST_BLOCK setting, we
still needed to specify --distribution=block:block for our binding to be
correct for OpenMP+MPI hybrid jobs.

Best,
Jesse

Post by Kilian Cavalotti
Hi Jesse,
Just a shot in the dark, but do you use task affinity or CPU binding?
Cheers,

Christopher Samuel

2014-08-21 23:15:33 UTC

Permalink

Post by Jesse Stroik
We recently noticed sporadic performance inconsistencies on one of our
clusters.

What distro is this? Are you using cgroups?

cheers,
Chris

--
Christopher Samuel Senior Systems Administrator
VLSCI - Victorian Life Sciences Computation Initiative
Email: samuel-***@public.gmane.org Phone: +61 (0)3 903 55545
http://www.vlsci.org.au/ http://twitter.com/vlsci

Jesse Stroik

2014-08-22 15:18:34 UTC

Permalink

Centos 6. We're not running cgroups. For these particular jobs, we're
letting MPI choose the bindings rather than specifying the CPU binding.

For reference, the MPI latency gap we're looking at is 10x and it's over
a very small number of CPU cores (about 700).

Best,
Jesse

Post by Christopher Samuel

Post by Jesse Stroik
We recently noticed sporadic performance inconsistencies on one of our
clusters.

What distro is this? Are you using cgroups?
cheers,
Chris

Andy Riebs

2014-08-25 13:24:32 UTC

Permalink

Assuming this is a Gnu/Linux system, be sure that you have
/etc/sysconfig/slurm on all nodes with the line

ulimit -l unlimited

That can account for differences in processing between system startup
and subsequently restarting the daemons by hand.

Andy

Post by Jesse Stroik
Slurmites,
We recently noticed sporadic performance inconsistencies on one of our
clusters. We discovered that if we restarted slurmd in an interactive
shell, we observed correct performance.
(1) single-node linpack
(2) dual node mp_linpack
(3) mpptest
On affected nodes, Linpack performance was normal and mp_linpack was
about 85% as high as expected.
mpptest, which measures MPI performance, was our smoking gun.
Latencies to be 10x higher than expected (~20us instead of < 2us). We
were able to consistently reproduce the issue with freshly imaged or
freshly rebooted nodes. Upon restarting slurmd on each execution node
manually, MPI latencies immediately improved to the expected < 2us for
our set of tested nodes.
The cluster is under fairly heavy use right now so we don't have the
luxury of diagnosing this thoroughly and determining the cause. We
wanted to share this experience with others in case it can help other
users or if any slurm developers would like us to file a bug report
and be interested in gathering further information.
Best,
Jesse Stroik
University of Wisconsin

Jesse Stroik

2014-08-25 20:38:37 UTC

Permalink

I performed some testing on this today with a set of freshly imaged
nodes and it appears to have worked.

Thanks, Andy.

Best,
Jesse

Post by Andy Riebs
Assuming this is a Gnu/Linux system, be sure that you have
/etc/sysconfig/slurm on all nodes with the line
ulimit -l unlimited
That can account for differences in processing between system startup
and subsequently restarting the daemons by hand.
Andy

Holmes, Christopher (CMU)

2014-08-28 13:54:32 UTC

Permalink

Andy is right. When you restart the slurmd daemon, it inherits the system limits from your login session, which are different from the default system limits when the daemon is started on boot-up.

If you modified /etc/security/limits.conf, or made changes in any of the bash startup scripts to improve the user environment, you should ensure that those same changes are added to /etc/sysconfig/slurm so that they can be applied to the slurmd daemons on boot-up.

Regards,
--Chris

-----Original Message-----
From: Riebs, Andy
Sent: Monday, August 25, 2014 6:24 AM
To: slurm-dev
Subject: [slurm-dev] Re: Intel MPI Performance inconsistency (and workaround)

Assuming this is a Gnu/Linux system, be sure that you have /etc/sysconfig/slurm on all nodes with the line

ulimit -l unlimited

That can account for differences in processing between system startup and subsequently restarting the daemons by hand.

Andy

Jesse Stroik

2014-08-28 15:17:36 UTC

Permalink

Yes, I did test this and can confirm it worked. Thanks.

Best,
Jesse Stroik
University of Wisconsin

Post by Holmes, Christopher (CMU)
Andy is right. When you restart the slurmd daemon, it inherits the system limits from your login session, which are different from the default system limits when the daemon is started on boot-up.
If you modified /etc/security/limits.conf, or made changes in any of the bash startup scripts to improve the user environment, you should ensure that those same changes are added to /etc/sysconfig/slurm so that they can be applied to the slurmd daemons on boot-up.
Regards,
--Chris
-----Original Message-----
From: Riebs, Andy
Sent: Monday, August 25, 2014 6:24 AM
To: slurm-dev
Subject: [slurm-dev] Re: Intel MPI Performance inconsistency (and workaround)
Assuming this is a Gnu/Linux system, be sure that you have /etc/sysconfig/slurm on all nodes with the line
ulimit -l unlimited
That can account for differences in processing between system startup and subsequently restarting the daemons by hand.
Andy