Discussion:
Preemption and job cancellation
Satrajit Ghosh
2014-08-05 21:46:53 UTC
Permalink
hi

out cluster is setup with the configuration below. yet we have been having
a lot of jobs cancelled when preempted:

slurmd[node004]: *** JOB 79188 CANCELLED AT 2014-08-05T15:31:41 DUE TO
PREEMPTION ***
i thought the settings would simply suspend the job instead of canceling it.

cheers,

satra

Partial configuration
---------------------------

PreemptMode=GANG,SUSPEND

PreemptType=preempt/partition_prio

# default

SchedulerTimeSlice=30

DefMemPerCPU=2048

DefMemPerNode=2048

PartitionName=DEFAULT MaxTime=7-0 DefaultTime=24:00:00

# Partitions

PartitionName=defq Default=NO MinNodes=1 DefaultTime=1-00:00:00
MaxTime=7-00:00:00 AllowGroups=ALL Priority=1 DisableRootJobs=NO
RootOnly=NO Hidden=YES Shared=NO GraceTime=0 ReqResv=NO
PreemptMode=GANG,SUSPEND State=UP

PartitionName=om_all_nodes Default=YES MinNodes=1 DefaultTime=1-00:00:00
MaxTime=7-00:00:00 AllowGroups=ALL Priority=1 DisableRootJobs=NO
RootOnly=NO Hidden=NO Shared=FORCE:4 GraceTime=0 ReqResv=NO
PreemptMode=GANG,SUSPEND State=UP Nodes=node[001-030]

PartitionName=om_interactive Default=NO MinNodes=1 MaxNodes=1
DefaultTime=01:00:00 MaxTime=01:00:00 AllowGroups=ALL Priority=10
DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=FORCE:1 GraceTime=0
MaxCPUsPerNode=32 ReqResv=NO PreemptMode=GANG,SUSPEND State=UP Nodes=node017
Marcin Stolarek
2014-08-05 22:35:33 UTC
Permalink
Post by Satrajit Ghosh
hi
out cluster is setup with the configuration below. yet we have been having
slurmd[node004]: *** JOB 79188 CANCELLED AT 2014-08-05T15:31:41 DUE TO
PREEMPTION ***
i thought the settings would simply suspend the job instead of canceling it.
cheers,
satra
Partial configuration
---------------------------
PreemptMode=GANG,SUSPEND
PreemptType=preempt/partition_prio
# default
SchedulerTimeSlice=30
DefMemPerCPU=2048
DefMemPerNode=2048
PartitionName=DEFAULT MaxTime=7-0 DefaultTime=24:00:00
# Partitions
PartitionName=defq Default=NO MinNodes=1 DefaultTime=1-00:00:00
MaxTime=7-00:00:00 AllowGroups=ALL Priority=1 DisableRootJobs=NO
RootOnly=NO Hidden=YES Shared=NO GraceTime=0 ReqResv=NO
PreemptMode=GANG,SUSPEND State=UP
PartitionName=om_all_nodes Default=YES MinNodes=1 DefaultTime=1-00:00:00
MaxTime=7-00:00:00 AllowGroups=ALL Priority=1 DisableRootJobs=NO
RootOnly=NO Hidden=NO Shared=FORCE:4 GraceTime=0 ReqResv=NO
PreemptMode=GANG,SUSPEND State=UP Nodes=node[001-030]
PartitionName=om_interactive Default=NO MinNodes=1 MaxNodes=1
DefaultTime=01:00:00 MaxTime=01:00:00 AllowGroups=ALL Priority=10
DisableRootJobs=NO RootOnly=NO Hidden=NO Shared=FORCE:1 GraceTime=0
MaxCPUsPerNode=32 ReqResv=NO PreemptMode=GANG,SUSPEND State=UP Nodes=node017
If I remember the logic correctly it will try to suspend you job, but if
the plugin (proctrack?) will fail to suspend you job, the job will be
killed.

Are you using cgroups freezer or SIGSTOP to suspend you jobs?

hope this can help

marcin
Satrajit Ghosh
2014-08-15 12:39:33 UTC
Permalink
hi marcin,
Post by Satrajit Ghosh
Post by Satrajit Ghosh
slurmd[node004]: *** JOB 79188 CANCELLED AT 2014-08-05T15:31:41 DUE TO
PREEMPTION ***
If I remember the logic correctly it will try to suspend you job, but if
the plugin (proctrack?) will fail to suspend you job, the job will be
killed.
Are you using cgroups freezer or SIGSTOP to suspend you jobs?
we are simply using SIGSTOP. is there a way to check why the plugin fails?

cheers,

satra

Loading...