Discussion:
Job migration from node to node or partition to partition
Sefa Arslan
2012-08-28 13:34:04 UTC
Permalink
I want to migrate a running job from one node to another (or from a
partition to another partititon)

I am using the following commands for this operation:
# scontrol update jobid=100002 partition=best
and get the "slurm_update error: Requested operation is presently disabled"

or
#scontrol update jobid=100002 nodelist=lufer118
and I get "slurm_update error: Invalid node name specified".


What may be the problem of my configuration? How should I change it? I
search on web but I couldn't find much things.

We are using SLRUM-2.4.1 on this test cluster.


The scontrol show config output is:
Configuration data as of 2012-08-28T16:19:31
AccountingStorageBackupHost = (null)
AccountingStorageEnforce = associations,limits
AccountingStorageHost = lufer110
AccountingStorageLoc = N/A
AccountingStoragePort = 6819
AccountingStorageType = accounting_storage/slurmdbd
AccountingStorageUser = N/A
AccountingStoreJobComment = YES
AuthType = auth/munge
BackupAddr = (null)
BackupController = (null)
BatchStartTimeout = 10 sec
BOOT_TIME = 2012-08-28T16:07:17
CacheGroups = 1
CheckpointType = checkpoint/none
ClusterName = linux
CompleteWait = 0 sec
ControlAddr = lufer110
ControlMachine = lufer110
CryptoType = crypto/munge
DebugFlags = (null)
DefMemPerNode = UNLIMITED
DisableRootJobs = NO
EnforcePartLimits = YES
Epilog = (null)
EpilogMsgTime = 2000 usec
EpilogSlurmctld = (null)
FastSchedule = 1
FirstJobId = 100000
GetEnvTimeout = 2 sec
GresTypes = (null)
GroupUpdateForce = 0
GroupUpdateTime = 600 sec
HASH_VAL = Match
HealthCheckInterval = 0 sec
HealthCheckProgram = (null)
InactiveLimit = 0 sec
JobAcctGatherFrequency = 30 sec
JobAcctGatherType = jobacct_gather/linux
JobCheckpointDir = /var/slurm/checkpoint
JobCompHost = localhost
JobCompLoc = /var/log/slurm/job_completions
JobCompPort = 0
JobCompType = jobcomp/filetxt
JobCompUser = root
JobCredentialPrivateKey = (null)
JobCredentialPublicCertificate = (null)
JobFileAppend = 0
JobRequeue = 1
JobSubmitPlugins = (null)
KillOnBadExit = 0
KillWait = 30 sec
Licenses = (null)
MailProg = /bin/mail
MaxJobCount = 1000000
MaxJobId = 4294901760
MaxMemPerNode = UNLIMITED
MaxStepCount = 40000
MaxTasksPerNode = 128
MessageTimeout = 10 sec
MinJobAge = 300 sec
MpiDefault = none
MpiParams = (null)
NEXT_JOB_ID = 100004
OverTimeLimit = 0 min
PluginDir = /usr/lib64/slurm
PlugStackConfig = /etc/slurm/plugstack.conf
PreemptMode = GANG
PreemptType = preempt/none
PriorityDecayHalfLife = 00:00:00
PriorityCalcPeriod = 00:05:00
PriorityFavorSmall = 0
PriorityFlags = 0
PriorityMaxAge = 14-00:00:00
PriorityUsageResetPeriod = NONE
PriorityType = priority/multifactor
PriorityWeightAge = 1000
PriorityWeightFairShare = 10000
PriorityWeightJobSize = 1000
PriorityWeightPartition = 1000
PriorityWeightQOS = 0
PrivateData = none
ProctrackType = proctrack/cgroup
Prolog = (null)
PrologSlurmctld = (null)
PropagatePrioProcess = 0
PropagateResourceLimits = (null)
PropagateResourceLimitsExcept = MEMLOCK
RebootProgram = (null)
ReconfigFlags = (null)
ResumeProgram = (null)
ResumeRate = 300 nodes/min
ResumeTimeout = 60 sec
ResvOverRun = 0 min
ReturnToService = 0
SallocDefaultCommand = (null)
SchedulerParameters = (null)
SchedulerPort = 7321
SchedulerRootFilter = 1
SchedulerTimeSlice = 30 sec
SchedulerType = sched/builtin
SelectType = select/cons_res
SelectTypeParameters = CR_CPU_MEMORY
SlurmUser = root(0)
SlurmctldDebug = debug3
SlurmctldLogFile = /var/log/slurm/slurmctld.log
SlurmSchedLogFile = (null)
SlurmctldPort = 6817
SlurmctldTimeout = 300 sec
SlurmdDebug = debug3
SlurmdLogFile = /var/log/slurm/slurmd.log
SlurmdPidFile = /var/run/slurmd.pid
SlurmdPort = 6818
SlurmdSpoolDir = /tmp/slurmd
SlurmdTimeout = 300 sec
SlurmdUser = root(0)
SlurmSchedLogLevel = 0
SlurmctldPidFile = /var/run/slurmctld.pid
SLURM_CONF = /etc/slurm/slurm.conf
SLURM_VERSION = 2.4.1
SrunEpilog = (null)
SrunProlog = (null)
StateSaveLocation = /home/slurm/slurm-lufer110.state
SuspendExcNodes = (null)
SuspendExcParts = (null)
SuspendProgram = (null)
SuspendRate = 60 nodes/min
SuspendTime = NONE
SuspendTimeout = 30 sec
SwitchType = switch/none
TaskEpilog = (null)
TaskPlugin = task/cgroup
TaskPluginParam = (null type)
TaskProlog = (null)
TmpFS = /tmp
TopologyPlugin = topology/none
TrackWCKey = 0
TreeWidth = 50
UsePam = 1
UnkillableStepProgram = (null)
UnkillableStepTimeout = 60 sec
VSizeFactor = 0 percent
WaitTime = 0 sec

Slurmctld(primary/backup) at lufer110/(NULL) are UP/DOWN
--
*Sefa ARSLAN*
Arastirmaci
Ag Teknolojileri Birimi
TUBITAK ULAKBIM
YOK Binasi B5 Blok Kat:3 Bilkent
06539 ANKARA
T +90 312 298 9397
F +90 312 298 9397
www.ulakbim.gov.tr <http://www.ulakbim.gov.tr/>
sefa.arslan-***@public.gmane.org
................................................................................................................................

TUBITAK-ULAKBIM <http://www.ulakbim.gov.tr/>

Sorumluluk Reddi <http://www.tubitak.gov.tr/sorumlulukreddi>
Moe Jette
2012-08-28 16:05:03 UTC
Permalink
You can checkpoint/restart a job or requeue it (which cancels the
original job). These only work for batch jobs.
Post by Sefa Arslan
I want to migrate a running job from one node to another (or from a
partition to another partititon)
# scontrol update jobid=100002 partition=best
and get the "slurm_update error: Requested operation is presently disabled"
or
#scontrol update jobid=100002 nodelist=lufer118
and I get "slurm_update error: Invalid node name specified".
What may be the problem of my configuration? How should I change it?
I search on web but I couldn't find much things.
We are using SLRUM-2.4.1 on this test cluster.
Configuration data as of 2012-08-28T16:19:31
AccountingStorageBackupHost = (null)
AccountingStorageEnforce = associations,limits
AccountingStorageHost = lufer110
AccountingStorageLoc = N/A
AccountingStoragePort = 6819
AccountingStorageType = accounting_storage/slurmdbd
AccountingStorageUser = N/A
AccountingStoreJobComment = YES
AuthType = auth/munge
BackupAddr = (null)
BackupController = (null)
BatchStartTimeout = 10 sec
BOOT_TIME = 2012-08-28T16:07:17
CacheGroups = 1
CheckpointType = checkpoint/none
ClusterName = linux
CompleteWait = 0 sec
ControlAddr = lufer110
ControlMachine = lufer110
CryptoType = crypto/munge
DebugFlags = (null)
DefMemPerNode = UNLIMITED
DisableRootJobs = NO
EnforcePartLimits = YES
Epilog = (null)
EpilogMsgTime = 2000 usec
EpilogSlurmctld = (null)
FastSchedule = 1
FirstJobId = 100000
GetEnvTimeout = 2 sec
GresTypes = (null)
GroupUpdateForce = 0
GroupUpdateTime = 600 sec
HASH_VAL = Match
HealthCheckInterval = 0 sec
HealthCheckProgram = (null)
InactiveLimit = 0 sec
JobAcctGatherFrequency = 30 sec
JobAcctGatherType = jobacct_gather/linux
JobCheckpointDir = /var/slurm/checkpoint
JobCompHost = localhost
JobCompLoc = /var/log/slurm/job_completions
JobCompPort = 0
JobCompType = jobcomp/filetxt
JobCompUser = root
JobCredentialPrivateKey = (null)
JobCredentialPublicCertificate = (null)
JobFileAppend = 0
JobRequeue = 1
JobSubmitPlugins = (null)
KillOnBadExit = 0
KillWait = 30 sec
Licenses = (null)
MailProg = /bin/mail
MaxJobCount = 1000000
MaxJobId = 4294901760
MaxMemPerNode = UNLIMITED
MaxStepCount = 40000
MaxTasksPerNode = 128
MessageTimeout = 10 sec
MinJobAge = 300 sec
MpiDefault = none
MpiParams = (null)
NEXT_JOB_ID = 100004
OverTimeLimit = 0 min
PluginDir = /usr/lib64/slurm
PlugStackConfig = /etc/slurm/plugstack.conf
PreemptMode = GANG
PreemptType = preempt/none
PriorityDecayHalfLife = 00:00:00
PriorityCalcPeriod = 00:05:00
PriorityFavorSmall = 0
PriorityFlags = 0
PriorityMaxAge = 14-00:00:00
PriorityUsageResetPeriod = NONE
PriorityType = priority/multifactor
PriorityWeightAge = 1000
PriorityWeightFairShare = 10000
PriorityWeightJobSize = 1000
PriorityWeightPartition = 1000
PriorityWeightQOS = 0
PrivateData = none
ProctrackType = proctrack/cgroup
Prolog = (null)
PrologSlurmctld = (null)
PropagatePrioProcess = 0
PropagateResourceLimits = (null)
PropagateResourceLimitsExcept = MEMLOCK
RebootProgram = (null)
ReconfigFlags = (null)
ResumeProgram = (null)
ResumeRate = 300 nodes/min
ResumeTimeout = 60 sec
ResvOverRun = 0 min
ReturnToService = 0
SallocDefaultCommand = (null)
SchedulerParameters = (null)
SchedulerPort = 7321
SchedulerRootFilter = 1
SchedulerTimeSlice = 30 sec
SchedulerType = sched/builtin
SelectType = select/cons_res
SelectTypeParameters = CR_CPU_MEMORY
SlurmUser = root(0)
SlurmctldDebug = debug3
SlurmctldLogFile = /var/log/slurm/slurmctld.log
SlurmSchedLogFile = (null)
SlurmctldPort = 6817
SlurmctldTimeout = 300 sec
SlurmdDebug = debug3
SlurmdLogFile = /var/log/slurm/slurmd.log
SlurmdPidFile = /var/run/slurmd.pid
SlurmdPort = 6818
SlurmdSpoolDir = /tmp/slurmd
SlurmdTimeout = 300 sec
SlurmdUser = root(0)
SlurmSchedLogLevel = 0
SlurmctldPidFile = /var/run/slurmctld.pid
SLURM_CONF = /etc/slurm/slurm.conf
SLURM_VERSION = 2.4.1
SrunEpilog = (null)
SrunProlog = (null)
StateSaveLocation = /home/slurm/slurm-lufer110.state
SuspendExcNodes = (null)
SuspendExcParts = (null)
SuspendProgram = (null)
SuspendRate = 60 nodes/min
SuspendTime = NONE
SuspendTimeout = 30 sec
SwitchType = switch/none
TaskEpilog = (null)
TaskPlugin = task/cgroup
TaskPluginParam = (null type)
TaskProlog = (null)
TmpFS = /tmp
TopologyPlugin = topology/none
TrackWCKey = 0
TreeWidth = 50
UsePam = 1
UnkillableStepProgram = (null)
UnkillableStepTimeout = 60 sec
VSizeFactor = 0 percent
WaitTime = 0 sec
Slurmctld(primary/backup) at lufer110/(NULL) are UP/DOWN
--
*Sefa ARSLAN*
Arastirmaci
Ag Teknolojileri Birimi
TUBITAK ULAKBIM
YOK Binasi B5 Blok Kat:3 Bilkent
06539 ANKARA
T +90 312 298 9397
F +90 312 298 9397
www.ulakbim.gov.tr <http://www.ulakbim.gov.tr/>
................................................................................................................................
TUBITAK-ULAKBIM <http://www.ulakbim.gov.tr/>
Sorumluluk Reddi <http://www.tubitak.gov.tr/sorumlulukreddi>
Sefa Arslan
2012-08-31 13:32:04 UTC
Permalink
I try to checkpoint jobs. But get following errors on my log files, and
the checkpoint files does not been created:


Log of the server:

[2012-08-31T15:45:36] debug: _checkpoint_job_record: checkpoint job record of 100079 to file /home_palamut1/slurm/checkpoint/100079.ckpt

[2012-08-31T15:45:36] checkpoint_op 3 of 100079.4294967294 complete, rc=0

[2012-08-31T15:45:36] debug3: checkpoint/blcr: sending checkpoint tasks request 3 to 100079.4294967294

[2012-08-31T15:45:36] debug2: Tree head got back 0 looking for 1

[2012-08-31T15:45:36] debug3: Tree sending to lufer113

[2012-08-31T15:45:36] debug2: Tree head got back 1

[2012-08-31T15:45:36] debug2: Tree head got them all

[2012-08-31T15:45:36] error: checkpoint/blcr: error on checkpoint request 3 to 100079.4294967294: Job step not running



Log of the active node:

[2012-08-31T15:45:42] [100079] checkpoint/blcr: slurm_ckpt_signal_tasks: image_dir=/home_palamut1/sefa/checkpoint/100079

[2012-08-31T15:45:42] [100079] Error sending checkpoint request to 100079.4294967294: Unspecified error

[2012-08-31T15:45:42] [100079] Leaving _handle_request: SLURM_SUCCESS

[2012-08-31T15:45:42] [100079] Entering _handle_request

[2012-08-31T15:45:42] [100079] Leaving _handle_accept



I have

JobCheckpointDir = /home_palamut1/slurm/checkpoint

CheckpointType = checkpoint/blcr


lines in my slurm.conf file. My job file includes

#SBATCH --checkpoint=1

#SBATCH --checkpoint-dir=/home_palamut1/sefa/checkpoint

mpirun ./count


lines.

Note1: previosly, I was getting an error something like "debug:
checkpoint/blcr: file sbin/scch" since I don't have it. Then I created a
script file which does nothing (just #!/bin/bash). After that I started
to get the above error.

Note2: I am building slurm with " rpmbuild -tb slurm-2.4.1.tar.bz2" and
the srun_cr command wasn't created.

*Sefa ARSLAN*
Arastirmaci
Ag Teknolojileri Birimi
TUBITAK ULAKBIM
YOK Binasi B5 Blok Kat:3 Bilkent
06539 ANKARA
T +90 312 298 9397
F +90 312 298 9397
www.ulakbim.gov.tr <http://www.ulakbim.gov.tr/>
sefa.arslan-***@public.gmane.org
................................................................................................................................
<http://www.ulakbim.gov.tr/>

Sorumluluk Reddi <http://www.tubitak.gov.tr/sorumlulukreddi>
Post by Moe Jette
You can checkpoint/restart a job or requeue it (which cancels the
original job). These only work for batch jobs.
Post by Sefa Arslan
I want to migrate a running job from one node to another (or from a
partition to another partititon)
# scontrol update jobid=100002 partition=best
and get the "slurm_update error: Requested operation is presently disabled"
or
#scontrol update jobid=100002 nodelist=lufer118
and I get "slurm_update error: Invalid node name specified".
What may be the problem of my configuration? How should I change it?
I search on web but I couldn't find much things.
We are using SLRUM-2.4.1 on this test cluster.
Configuration data as of 2012-08-28T16:19:31
AccountingStorageBackupHost = (null)
AccountingStorageEnforce = associations,limits
AccountingStorageHost = lufer110
AccountingStorageLoc = N/A
AccountingStoragePort = 6819
AccountingStorageType = accounting_storage/slurmdbd
AccountingStorageUser = N/A
AccountingStoreJobComment = YES
AuthType = auth/munge
BackupAddr = (null)
BackupController = (null)
BatchStartTimeout = 10 sec
BOOT_TIME = 2012-08-28T16:07:17
CacheGroups = 1
CheckpointType = checkpoint/none
ClusterName = linux
CompleteWait = 0 sec
ControlAddr = lufer110
ControlMachine = lufer110
CryptoType = crypto/munge
DebugFlags = (null)
DefMemPerNode = UNLIMITED
DisableRootJobs = NO
EnforcePartLimits = YES
Epilog = (null)
EpilogMsgTime = 2000 usec
EpilogSlurmctld = (null)
FastSchedule = 1
FirstJobId = 100000
GetEnvTimeout = 2 sec
GresTypes = (null)
GroupUpdateForce = 0
GroupUpdateTime = 600 sec
HASH_VAL = Match
HealthCheckInterval = 0 sec
HealthCheckProgram = (null)
InactiveLimit = 0 sec
JobAcctGatherFrequency = 30 sec
JobAcctGatherType = jobacct_gather/linux
JobCheckpointDir = /var/slurm/checkpoint
JobCompHost = localhost
JobCompLoc = /var/log/slurm/job_completions
JobCompPort = 0
JobCompType = jobcomp/filetxt
JobCompUser = root
JobCredentialPrivateKey = (null)
JobCredentialPublicCertificate = (null)
JobFileAppend = 0
JobRequeue = 1
JobSubmitPlugins = (null)
KillOnBadExit = 0
KillWait = 30 sec
Licenses = (null)
MailProg = /bin/mail
MaxJobCount = 1000000
MaxJobId = 4294901760
MaxMemPerNode = UNLIMITED
MaxStepCount = 40000
MaxTasksPerNode = 128
MessageTimeout = 10 sec
MinJobAge = 300 sec
MpiDefault = none
MpiParams = (null)
NEXT_JOB_ID = 100004
OverTimeLimit = 0 min
PluginDir = /usr/lib64/slurm
PlugStackConfig = /etc/slurm/plugstack.conf
PreemptMode = GANG
PreemptType = preempt/none
PriorityDecayHalfLife = 00:00:00
PriorityCalcPeriod = 00:05:00
PriorityFavorSmall = 0
PriorityFlags = 0
PriorityMaxAge = 14-00:00:00
PriorityUsageResetPeriod = NONE
PriorityType = priority/multifactor
PriorityWeightAge = 1000
PriorityWeightFairShare = 10000
PriorityWeightJobSize = 1000
PriorityWeightPartition = 1000
PriorityWeightQOS = 0
PrivateData = none
ProctrackType = proctrack/cgroup
Prolog = (null)
PrologSlurmctld = (null)
PropagatePrioProcess = 0
PropagateResourceLimits = (null)
PropagateResourceLimitsExcept = MEMLOCK
RebootProgram = (null)
ReconfigFlags = (null)
ResumeProgram = (null)
ResumeRate = 300 nodes/min
ResumeTimeout = 60 sec
ResvOverRun = 0 min
ReturnToService = 0
SallocDefaultCommand = (null)
SchedulerParameters = (null)
SchedulerPort = 7321
SchedulerRootFilter = 1
SchedulerTimeSlice = 30 sec
SchedulerType = sched/builtin
SelectType = select/cons_res
SelectTypeParameters = CR_CPU_MEMORY
SlurmUser = root(0)
SlurmctldDebug = debug3
SlurmctldLogFile = /var/log/slurm/slurmctld.log
SlurmSchedLogFile = (null)
SlurmctldPort = 6817
SlurmctldTimeout = 300 sec
SlurmdDebug = debug3
SlurmdLogFile = /var/log/slurm/slurmd.log
SlurmdPidFile = /var/run/slurmd.pid
SlurmdPort = 6818
SlurmdSpoolDir = /tmp/slurmd
SlurmdTimeout = 300 sec
SlurmdUser = root(0)
SlurmSchedLogLevel = 0
SlurmctldPidFile = /var/run/slurmctld.pid
SLURM_CONF = /etc/slurm/slurm.conf
SLURM_VERSION = 2.4.1
SrunEpilog = (null)
SrunProlog = (null)
StateSaveLocation = /home/slurm/slurm-lufer110.state
SuspendExcNodes = (null)
SuspendExcParts = (null)
SuspendProgram = (null)
SuspendRate = 60 nodes/min
SuspendTime = NONE
SuspendTimeout = 30 sec
SwitchType = switch/none
TaskEpilog = (null)
TaskPlugin = task/cgroup
TaskPluginParam = (null type)
TaskProlog = (null)
TmpFS = /tmp
TopologyPlugin = topology/none
TrackWCKey = 0
TreeWidth = 50
UsePam = 1
UnkillableStepProgram = (null)
UnkillableStepTimeout = 60 sec
VSizeFactor = 0 percent
WaitTime = 0 sec
Slurmctld(primary/backup) at lufer110/(NULL) are UP/DOWN
--
*Sefa ARSLAN*
Arastirmaci
Ag Teknolojileri Birimi
TUBITAK ULAKBIM
YOK Binasi B5 Blok Kat:3 Bilkent
06539 ANKARA
T +90 312 298 9397
F +90 312 298 9397
www.ulakbim.gov.tr<http://www.ulakbim.gov.tr/>
................................................................................................................................
TUBITAK-ULAKBIM<http://www.ulakbim.gov.tr/>
Sorumluluk Reddi<http://www.tubitak.gov.tr/sorumlulukreddi>
Continue reading on narkive:
Search results for 'Job migration from node to node or partition to partition' (Questions and Answers)
4
replies
acessing cached pages?
started 2006-07-07 05:56:14 UTC
computers & internet
Loading...