Dennis Zheleznyak
2014-07-13 08:54:38 UTC
Hi All,
Recently we've upgraded our storage system with more network cards so all
the nodes in the cluster can see it. Since then, we tried running the
command:
sbatch –n <no_of_cores> -C[rack1|rack2|rack3] –c<no_of_cores> <script>
However, when the job is queued, the other jobs that I'm trying to send (to
other nodes and features) using a normal sbatch command without a
constraint is being queued as well even though there are free resources.
When canceling the job with the -C options, jobs are queued and executed
properly, it only happens when I send the job with the -C first.
Why is this happening and how can I resolve it?
Part of slurm.conf:
Slurm Configuration:
# SCHEDULING
FastSchedule=0
SchedulerType=sched/backfill
SchedulerPort=7321
SelectType=select/cons_res
SelectTypeParameters=CR_Core_Memory
#
#
# JOB PRIORITY
PriorityType=priority/basic
#
# JOB PREEMPTION (optional)
PreemptMode=requeue
PreemptType=preempt/partition_prio
*Node Configuration:*
#rack1
NodeName=hnmp[106-164] NodeAddr=X.X.X.[106-164] Sockets=2 CoresPerSocket=4
ThreadsPerCore=1 RealMemory=1 State=UNKNOWN Feature="par,rack1"
NodeName=hnmp[101-105] NodeAddr=X.X.X.[101-105] Sockets=2 CoresPerSocket=4
ThreadsPerCore=1 RealMemory=1 State=UNKNOWN Feature="pls"
#
#rack2
NodeName=hnmp[27-80] NodeAddr=X.X.X.[27-80] Procs=12 RealMemory=1
State=UNKNOWN Feature="par,rack2"
#
#rack3
NodeName=hnmp[5001-5056] NodeAddr=X.X.X.[1-56] Procs=16 RealMemory=1
State=UNKNOWN Feature="par,rack3"
#rack4
NodeName=hnmp[5057-5100] NodeAddr=X.X.X.[57-100] Procs=16 RealMemory=1
State=UNKNOWN Feature="par,rack4"
*Partitions Properties:*
#Partitions
#
# priority partitions
#
PartitionName=low Nodes=hnmp[106-164] Default=NO MaxTime=INFINITE State=UP
Shared=NO Priority=10 PreemptMode=requeue
PartitionName=hi Nodes=hnmp[106-164] Default=NO MaxTime=INFINITE State=UP
Shared=NO Priority=30 PreemptMode=off
PartitionName=med Nodes=hnmp[106-164] Default=NO MaxTime=INFINITE State=UP
Shared=NO Priority=20 PreemptMode=off
# lsdyna partiton
PartitionName=lsall Nodes=hnmp[05-07,09-16] Default=NO MaxTime=INFINITE
State=UP Shared=NO Priority=10 PreemptMode=off
# Default partition
#
PartitionName=hnm
Nodes=hnmp[01-16,18-26,101-164,27-80,165-176,181-196,5001-5100] Default=YES
MaxTime=INFINITE State=UP Shared=NO Priority=20 PreemptMode=off
# Backfill partition
#
PartitionName=hpc Nodes=hnmp[101-164,27-80,5001-5100] MaxTime=7-0 State=UP
Shared=NO Priority=20 PreemptMode=off
Recently we've upgraded our storage system with more network cards so all
the nodes in the cluster can see it. Since then, we tried running the
command:
sbatch –n <no_of_cores> -C[rack1|rack2|rack3] –c<no_of_cores> <script>
However, when the job is queued, the other jobs that I'm trying to send (to
other nodes and features) using a normal sbatch command without a
constraint is being queued as well even though there are free resources.
When canceling the job with the -C options, jobs are queued and executed
properly, it only happens when I send the job with the -C first.
Why is this happening and how can I resolve it?
Part of slurm.conf:
Slurm Configuration:
# SCHEDULING
FastSchedule=0
SchedulerType=sched/backfill
SchedulerPort=7321
SelectType=select/cons_res
SelectTypeParameters=CR_Core_Memory
#
#
# JOB PRIORITY
PriorityType=priority/basic
#
# JOB PREEMPTION (optional)
PreemptMode=requeue
PreemptType=preempt/partition_prio
*Node Configuration:*
#rack1
NodeName=hnmp[106-164] NodeAddr=X.X.X.[106-164] Sockets=2 CoresPerSocket=4
ThreadsPerCore=1 RealMemory=1 State=UNKNOWN Feature="par,rack1"
NodeName=hnmp[101-105] NodeAddr=X.X.X.[101-105] Sockets=2 CoresPerSocket=4
ThreadsPerCore=1 RealMemory=1 State=UNKNOWN Feature="pls"
#
#rack2
NodeName=hnmp[27-80] NodeAddr=X.X.X.[27-80] Procs=12 RealMemory=1
State=UNKNOWN Feature="par,rack2"
#
#rack3
NodeName=hnmp[5001-5056] NodeAddr=X.X.X.[1-56] Procs=16 RealMemory=1
State=UNKNOWN Feature="par,rack3"
#rack4
NodeName=hnmp[5057-5100] NodeAddr=X.X.X.[57-100] Procs=16 RealMemory=1
State=UNKNOWN Feature="par,rack4"
*Partitions Properties:*
#Partitions
#
# priority partitions
#
PartitionName=low Nodes=hnmp[106-164] Default=NO MaxTime=INFINITE State=UP
Shared=NO Priority=10 PreemptMode=requeue
PartitionName=hi Nodes=hnmp[106-164] Default=NO MaxTime=INFINITE State=UP
Shared=NO Priority=30 PreemptMode=off
PartitionName=med Nodes=hnmp[106-164] Default=NO MaxTime=INFINITE State=UP
Shared=NO Priority=20 PreemptMode=off
# lsdyna partiton
PartitionName=lsall Nodes=hnmp[05-07,09-16] Default=NO MaxTime=INFINITE
State=UP Shared=NO Priority=10 PreemptMode=off
# Default partition
#
PartitionName=hnm
Nodes=hnmp[01-16,18-26,101-164,27-80,165-176,181-196,5001-5100] Default=YES
MaxTime=INFINITE State=UP Shared=NO Priority=20 PreemptMode=off
# Backfill partition
#
PartitionName=hpc Nodes=hnmp[101-164,27-80,5001-5100] MaxTime=7-0 State=UP
Shared=NO Priority=20 PreemptMode=off