Discussion:
slurm-14.03.3-2/git: make check fails, ../../../auxdir/test-driver missing
Tru Huynh
2014-06-17 10:03:26 UTC
Permalink
hello,

Am I missing something obvious? I am using the usual
./configure --prefix .. && make && make check && make install.

I have tried with the 14.03.3-2 tarball and the git version,
both are failling the "make check" step.

make[2]: Entering directory `/dev/shm/slurm/testsuite/slurm_unit'
Making check in api
make[3]: Entering directory `/dev/shm/slurm/testsuite/slurm_unit/api'
Making check in manual
make[4]: Entering directory `/dev/shm/slurm/testsuite/slurm_unit/api/manual'
make cancel-tst complete-tst job_info-tst node_info-tst partition_info-tst reconfigure-tst submit-tst update_config-tst
make[5]: Entering directory `/dev/shm/slurm/testsuite/slurm_unit/api/manual'
make[5]: `cancel-tst' is up to date.
make[5]: `complete-tst' is up to date.
make[5]: `job_info-tst' is up to date.
make[5]: `node_info-tst' is up to date.
make[5]: `partition_info-tst' is up to date.
make[5]: `reconfigure-tst' is up to date.
make[5]: `submit-tst' is up to date.
make[5]: `update_config-tst' is up to date.
make[5]: Leaving directory `/dev/shm/slurm/testsuite/slurm_unit/api/manual'
make[4]: Leaving directory `/dev/shm/slurm/testsuite/slurm_unit/api/manual'
make[4]: Entering directory `/dev/shm/slurm/testsuite/slurm_unit/api'
make api-test
make[5]: Entering directory `/dev/shm/slurm/testsuite/slurm_unit/api'
make[5]: `api-test' is up to date.
make[5]: Leaving directory `/dev/shm/slurm/testsuite/slurm_unit/api'
make check-TESTS
make[5]: Entering directory `/dev/shm/slurm/testsuite/slurm_unit/api'
make[6]: Entering directory `/dev/shm/slurm/testsuite/slurm_unit/api'
/bin/sh: ../../../auxdir/test-driver: No such file or directory
make[6]: *** [api-test.log] Error 127
make[6]: Leaving directory `/dev/shm/slurm/testsuite/slurm_unit/api'
make[5]: *** [check-TESTS] Error 2
make[5]: Leaving directory `/dev/shm/slurm/testsuite/slurm_unit/api'
make[4]: *** [check-am] Error 2
make[4]: Leaving directory `/dev/shm/slurm/testsuite/slurm_unit/api'
make[3]: *** [check-recursive] Error 1
make[3]: Leaving directory `/dev/shm/slurm/testsuite/slurm_unit/api'
make[2]: *** [check-recursive] Error 1
make[2]: Leaving directory `/dev/shm/slurm/testsuite/slurm_unit'
make[1]: *** [check-recursive] Error 1
make[1]: Leaving directory `/dev/shm/slurm/testsuite'
make: *** [check-recursive] Error 1

configure line for the git version:
./configure --prefix=/dev/shm/slurm-git-hdf5-1.8-hwloc-1.8-munge-0.5.11 --with-hdf5=/baycells/home/slurm/hdf5/1.8.11/bin/h5cc --with-hwloc=/baycells/home/slurm/hwloc/1.8 --with-munge=/opt/munge/0.5.11 --enable-pam --enable-front-end

Thanks

Tru
--
Dr Tru Huynh | http://www.pasteur.fr/recherche/unites/Binfs/
mailto:tru-M1hYzG+***@public.gmane.org | tel/fax +33 1 45 68 87 37/19
Institut Pasteur, 25-28 rue du Docteur Roux, 75724 Paris CEDEX 15 France
Jeff Tan
2014-06-18 04:12:32 UTC
Permalink
Hi folks

I've also recently encountered a similar problem, but in my case, I'm
wondering if this is normal or not.

We have a user submitting MPI jobs with

#SBATCH --nodes=8
#SBATCH --exclusive
#SBATCH --mem-per-cpu=131072

and mpiexec called with 128 tasks requested, not using srun.

which land on nodes with 250000MB real memory and 16 cores, e.g., with
sinfo -o %N,%c,%m,%Z

NODELIST,CPUS,MEMORY,THREADS
barcoo004,16,250000,1

# scontrol show job 1678855
JobId=1678855 Name=ls_job
..
JobState=RUNNING Reason=None Dependency=(null)
Requeue=0 Restarts=0 BatchFlag=1 ExitCode=0:0
RunTime=1-21:53:28 TimeLimit=2-12:00:00 TimeMin=N/A
..
NodeList=barcoo[004,051,062,066-070]
BatchHost=barcoo004
NumNodes=8 NumCPUs=8 CPUs/Task=1 ReqS:C:T=*:*:*
MinCPUsNode=1 MinMemoryCPU=128G MinTmpDiskNode=0

It was our guess that with mem-per-cpu=131072MB, only one task can fit each
node, so Slurm accounting only includes one CPU per node

# sacct -j 1678855 -o jobid,nnodes,ntasks,alloccpus
JobID NNodes NTasks AllocCPUS
------------ -------- -------- ----------
1678855 8 8
1678855.0 7 7 7

However, all 16 CPUs in the node is allocated (because of --exclusive):

# scontrol show node barcoo004
NodeName=barcoo004 Arch=x86_64 CoresPerSocket=8
CPUAlloc=16 CPUErr=0 CPUTot=16 CPULoad=16.00 Features=(null)
Gres=(null)
NodeAddr=barcoo004 NodeHostName=barcoo004
OS=Linux RealMemory=250000 AllocMem=131072 Sockets=2 Boards=1
State=ALLOCATED ThreadsPerCore=1 TmpDisk=0 Weight=2
BootTime=2014-02-25T23:28:52 SlurmdStartTime=2014-05-13T09:45:33
CurrentWatts=0 LowestJoules=0 ConsumedJoules=0
ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

Note CPUAlloc=16 but AllocMem=131072 -- the latter implying only one CPU is
allocated. And when we look inside, there are 16 MPI tasks running
(engaging all 16 cores).

Is the disparity because mpiexec is called instead of srun, due to
performance degradation with the latter on 2.6.5?

Regards

Jeff Tan

High Performance Computing Specialist
IBM Research Collaboratory for Life Sciences, Melbourne, Australia
Tru Huynh
2014-07-07 19:32:35 UTC
Permalink
Hi,
Post by Tru Huynh
hello,
Am I missing something obvious? I am using the usual
./configure --prefix .. && make && make check && make install.
I have tried with the 14.03.3-2 tarball and the git version,
both are failling the "make check" step.
...
Post by Tru Huynh
make check-TESTS
make[5]: Entering directory `/dev/shm/slurm/testsuite/slurm_unit/api'
make[6]: Entering directory `/dev/shm/slurm/testsuite/slurm_unit/api'
/bin/sh: ../../../auxdir/test-driver: No such file or directory
make[6]: *** [api-test.log] Error 127
...
Post by Tru Huynh
make: *** [check-recursive] Error 1
./configure --prefix=/dev/shm/slurm-git-hdf5-1.8-hwloc-1.8-munge-0.5.11 --with-hdf5=/baycells/home/slurm/hdf5/1.8.11/bin/h5cc --with-hwloc=/baycells/home/slurm/hwloc/1.8 --with-munge=/opt/munge/0.5.11 --enable-pam --enable-front-end
Just following up, same issue with slurm-14.03.4-2.tar.bz2 until
some co-workers hinted that I should run ./autogen.sh prior to
./configure ... That fixed the missing file "test-driver" issue.

The last "make check" failure was:
...
bitstring-test.c: In function 'main':
bitstring-test.c:23: error: expected ')' before numeric constant
bitstring-test.c:26: error: 'bs' undeclared (first use in this function)
bitstring-test.c:26: error: (Each undeclared identifier is reported only
once
bitstring-test.c:26: error: for each function it appears in.)
make[4]: *** [bitstring-test.o] Error 1
make[4]: Leaving directory
`/baycells/home/tru/build/slurm-14.03.3-2/testsuite/slurm_unit/common'
make[3]: *** [check-am] Error 2
make[3]: Leaving directory
`/baycells/home/tru/build/slurm-14.03.3-2/testsuite/slurm_unit/common'
make[2]: *** [check-recursive] Error 1
make[2]: Leaving directory
`/baycells/home/tru/build/slurm-14.03.3-2/testsuite/slurm_unit'
make[1]: *** [check-recursive] Error 1
make[1]: Leaving directory
`/baycells/home/tru/build/slurm-14.03.3-2/testsuite'
make: *** [check-recursive] Error 1

which is fixed in the git tree commit e88b98995aea36aa32ea7d6b090f29fd504bd29f

ymmv,

Tru
--
Dr Tru Huynh | http://www.pasteur.fr/recherche/unites/Binfs/
mailto:tru-M1hYzG+***@public.gmane.org | tel/fax +33 1 45 68 87 37/19
Institut Pasteur, 25-28 rue du Docteur Roux, 75724 Paris CEDEX 15 France
Loading...