Discussion:
DRMAA job submission returns state FAILED
Julien Rey
2014-10-20 12:29:56 UTC
Permalink
Hello everyone,


I am currently having troubles making python-drmaa work with Slurm. Indeed,
jobs systematically return a FAILED state (exit code 256) when I launch
jobs with python-drmaa as non-root user. I have no problem if I run as
root. Here's the code sample I've been using to do some tests:

#!/usr/bin/env python
import os

os.environ [ 'DRMAA_LIBRARY_PATH' ] =
'/usr/lib/slurm-drmaa/lib/libdrmaa.so.1.0.6'
import drmaa

def main():

s = drmaa.Session()
s.initialize()

print 'Creating job template'
jt = s.createJobTemplate()
jt.nativeSpecification = ''
jt.remoteCommand = 'sleep'
jt.args = '30'

jobid = s.runJob(jt)
print 'Your job has been submitted with id ' + jobid

jinfo=s.wait(jobid, drmaa.Session.TIMEOUT_WAIT_FOREVER)
print 'Job exited with ', jinfo.exitStatus

print 'Cleaning up'
s.deleteJobTemplate(jt)
s.exit()
if __name__=='__main__':
main()

Here are the results of the sacct command after I ran the script as user
and then as root:

519 allocation debug mti 1 FAILED
1:0 519.batch batch mti 1
FAILED 1:0 520 allocation debug root
1 COMPLETED 0:0 520.batch batch root
1 COMPLETED 0:0

And here are the logs from /var/log/slurm-llnl/slurmctld.log

Run as user:

[2014-10-17T13:27:56.819] _slurm_rpc_submit_batch_job JobId=519
usec=455[2014-10-17T13:27:56.823] sched: Allocate JobId=519
NodeList=node100 #CPUs=1[2014-10-17T13:27:56.859] completing job
519[2014-10-17T13:27:56.861] sched: job_complete for JobId=519
successful, exit code=256

Run as root:

[2014-10-17T13:28:39.879] _slurm_rpc_submit_batch_job JobId=520
usec=468[2014-10-17T13:28:39.882] sched: Allocate JobId=520
NodeList=node100 #CPUs=1[2014-10-17T13:28:42.963] completing job
520[2014-10-17T13:28:42.965] sched: job_complete for JobId=520
successful, exit code=0

Also I have no problem running jobs with the srun command as user, for
instance, if I run as www-data

srun sleep 30

and then

sacct -a

I get:

522 sleep debug mobyle 1 COMPLETED 0:0

Here are the packages that were installed:

- slurm-llnl 2.6.7-2+b1
- slurm-drmaa1 1.0.7-1
- python-drmaa 0.5-1

I am completly new to slurm and drmaa so I have no idea where to look for.

Any help will be greatly appreciated.
j***@public.gmane.org
2014-10-20 16:02:41 UTC
Permalink
I would recommend looking at the slurmd log on node100.
Post by Julien Rey
Hello everyone,
I am currently having troubles making python-drmaa work with Slurm. Indeed,
jobs systematically return a FAILED state (exit code 256) when I launch
jobs with python-drmaa as non-root user. I have no problem if I run as
#!/usr/bin/env python
import os
os.environ [ 'DRMAA_LIBRARY_PATH' ] =
'/usr/lib/slurm-drmaa/lib/libdrmaa.so.1.0.6'
import drmaa
s = drmaa.Session()
s.initialize()
print 'Creating job template'
jt = s.createJobTemplate()
jt.nativeSpecification = ''
jt.remoteCommand = 'sleep'
jt.args = '30'
jobid = s.runJob(jt)
print 'Your job has been submitted with id ' + jobid
jinfo=s.wait(jobid, drmaa.Session.TIMEOUT_WAIT_FOREVER)
print 'Job exited with ', jinfo.exitStatus
print 'Cleaning up'
s.deleteJobTemplate(jt)
s.exit()
main()
Here are the results of the sacct command after I ran the script as user
519 allocation debug mti 1 FAILED
1:0 519.batch batch mti 1
FAILED 1:0 520 allocation debug root
1 COMPLETED 0:0 520.batch batch root
1 COMPLETED 0:0
And here are the logs from /var/log/slurm-llnl/slurmctld.log
[2014-10-17T13:27:56.819] _slurm_rpc_submit_batch_job JobId=519
usec=455[2014-10-17T13:27:56.823] sched: Allocate JobId=519
NodeList=node100 #CPUs=1[2014-10-17T13:27:56.859] completing job
519[2014-10-17T13:27:56.861] sched: job_complete for JobId=519
successful, exit code=256
[2014-10-17T13:28:39.879] _slurm_rpc_submit_batch_job JobId=520
usec=468[2014-10-17T13:28:39.882] sched: Allocate JobId=520
NodeList=node100 #CPUs=1[2014-10-17T13:28:42.963] completing job
520[2014-10-17T13:28:42.965] sched: job_complete for JobId=520
successful, exit code=0
Also I have no problem running jobs with the srun command as user, for
instance, if I run as www-data
srun sleep 30
and then
sacct -a
522 sleep debug mobyle 1 COMPLETED 0:0
- slurm-llnl 2.6.7-2+b1
- slurm-drmaa1 1.0.7-1
- python-drmaa 0.5-1
I am completly new to slurm and drmaa so I have no idea where to look for.
Any help will be greatly appreciated.
--
Morris "Moe" Jette
CTO, SchedMD LLC
Loading...