Enforce to use srun and application logger

Bill Barth

2014-06-10 21:47:35 UTC

Jordi,

It's basically impossible to force people to call srun somewhere in their
batch script. If you only want to allow the very simplest of batch
scripts, then you can grep them at job submit time with a job submit
plugin, but if their script calls a script which calls a script (etc)
which calls srun, you'll never detect that they've done what you wanted.
Worse, you'll raise false positives all the time even though the users
have done what you wanted, just some levels down.

We have a wrapper around the MPI job starters that we support (MVAPICH2
and Intel MPI) that calls the right startup mechanisms with the right
arguments. But we haven't tried to force our users to use this script. The
vast majority of them do what we want because a) we train them on it and
document it well, and b) our method is generally easier to use than the
other options.

For monitoring, you might check out the project that I work on called TACC
Stats which provides accounting and performance monitoring for HPC jobs.
Some parts of the project are in a state a flux as we are adding new
features, but things should begin to stabilize this summer. TACC Stats
will also be working with a sister project called XALT which will also
have its first release this summer which will provide information about
executables and libraries used by HPC jobs. More information and source
code for TACC Stats can be found on GitHub, and XALT should be available
on GitHub later this summer.

git clone git-9UaJU3cA/F/QT0dZR+***@public.gmane.org:rtevans/tacc_stats.git (this will eventually move
the the main TACC GitHub, but that's a work in progress)

Best,
Bill.

--
Bill Barth, Ph.D., Director, HPC
bbarth-***@public.gmane.org | Phone: (512) 232-7069
Office: ROC 1.435 | Fax: (512) 475-9445

Post by Jordi Blasco
Hi,
we are using Snoopy library (https://github.com/a2o/snoopy) in order to
monitor and collect statistics regarding to the applications used in the
HPC resources.
Since there are more than 30% of the jobs in our database without any
information in this regard, it seems that Snoopy is not capable to track
everything.
Some other tools like PerfMiner or monitor
(http://web.eecs.utk.edu/~mucci/monitor/) are used in several places, but
since it relies on PapiEx (http://icl.cs.utk.edu/~mucci/papiex/),
and this project is no longer supported, I would like to know if there
is some other approach to collect this data.
In addition to that, I would like to know if it can be possible to
enforce to use srun in the submit script. I used a sbatch wrapper before,
but maybe there is now a better way to do it.
Thanks!
Regards,
Jordi