Discussion:
job_submit Lua plugin issue when executing "squeue" commands
Trey Dockendorf
2014-08-01 17:43:35 UTC
Permalink
I'm attempting to come up with a Lua job_submit plugin based off the example from the slurm src that assigns jobs to a QOS based on that QOS' currently allocated resources.

So right now we have the following partitions:

PartitionName=serial Nodes=c[0101-0104] Priority=100 AllowQOS=hepx,idhmc,general,aglife MaxNodes=1 MaxTime=120:00:00 PreemptMode=OFF State=UP
PartitionName=mpi_core8 Nodes=c[0925-0926]n[1-2] Priority=100 AllowQOS=mpi MinNodes=2 MaxTime=48:00:00 PreemptMode=OFF State=UP
PartitionName=mpi_core32 Nodes=c[0133-0134],c[0237-0238],c[0934-0936] Priority=100 AllowQOS=mpi MinNodes=2 MaxTime=48:00:00 PreemptMode=OFF State=UP
PartitionName=background Priority=10 AllowQOS=background,grid MaxTime=96:00:00 State=UP

We use partition preemption which is why the "background" partition exists.

Our desire is that users do not have to choose a QOS and that a QOS for stakeholders is chosen based off usage. So if the "hepx" QOS is running all their stakeholder CPUs then the submit plugin will assign them to the "general" QOS to run those additional jobs like a non-stakeholder.

To achieve this I used a command like the following

local cmd = "squeue --qos=" .. qos .. " --states=R --partition=" .. partition .. " --noheader --format='%C' | paste -sd+ | bc"

The output is captured using io.popen. Unfortunately when I perform any sbatch submission that requires the cmd to be executed I receive the following:

# sbatch --uid testuser_hepx -n2 -p serial batches/job_submit_lua_test.slrm
sbatch: error: slurm_receive_msg: Socket timed out on send/recv operation
sbatch: error: Batch job submission failed: Socket timed out on send/recv operation

I've changing Scheduler Parameters to "SchedulerParameters=batch_sched_delay=10,defer" with no luck and based off old mailing list topics I tried setting "net.ipv4.tcp_max_syn_backlog" to 8192 , and still the same problem. I notice in the logs that there is about a 10 second pause during the execution of that shell command. When I run it from the command line I have no delay.

My guess is that while submitting a job it is unwise or impossible to at the same time execute a "squeue" via the same process submitting the job (job_submit.lua).

Is there another way to achieve this functionality? I've uploaded my current script to here, https://gist.github.com/treydock/b964c5599fd057b0aa6a

Thanks,
- Trey

=============================

Trey Dockendorf
Systems Analyst I
Texas A&M University
Academy for Advanced Telecommunications and Learning Technologies
Phone: (979)458-2396
Email: treydock-mRW4Vj+***@public.gmane.org
Jabber: treydock-mRW4Vj+***@public.gmane.org

Loading...