Jesse Stroik
2014-08-21 18:42:35 UTC
Slurmites,
We recently noticed sporadic performance inconsistencies on one of our
clusters. We discovered that if we restarted slurmd in an interactive
shell, we observed correct performance.
To track down the cause, we ran:
(1) single-node linpack
(2) dual node mp_linpack
(3) mpptest
On affected nodes, Linpack performance was normal and mp_linpack was
about 85% as high as expected.
mpptest, which measures MPI performance, was our smoking gun. Latencies
to be 10x higher than expected (~20us instead of < 2us). We were able to
consistently reproduce the issue with freshly imaged or freshly rebooted
nodes. Upon restarting slurmd on each execution node manually, MPI
latencies immediately improved to the expected < 2us for our set of
tested nodes.
The cluster is under fairly heavy use right now so we don't have the
luxury of diagnosing this thoroughly and determining the cause. We
wanted to share this experience with others in case it can help other
users or if any slurm developers would like us to file a bug report and
be interested in gathering further information.
Best,
Jesse Stroik
University of Wisconsin
We recently noticed sporadic performance inconsistencies on one of our
clusters. We discovered that if we restarted slurmd in an interactive
shell, we observed correct performance.
To track down the cause, we ran:
(1) single-node linpack
(2) dual node mp_linpack
(3) mpptest
On affected nodes, Linpack performance was normal and mp_linpack was
about 85% as high as expected.
mpptest, which measures MPI performance, was our smoking gun. Latencies
to be 10x higher than expected (~20us instead of < 2us). We were able to
consistently reproduce the issue with freshly imaged or freshly rebooted
nodes. Upon restarting slurmd on each execution node manually, MPI
latencies immediately improved to the expected < 2us for our set of
tested nodes.
The cluster is under fairly heavy use right now so we don't have the
luxury of diagnosing this thoroughly and determining the cause. We
wanted to share this experience with others in case it can help other
users or if any slurm developers would like us to file a bug report and
be interested in gathering further information.
Best,
Jesse Stroik
University of Wisconsin