Checkpointing using BLCR and cons_res failing in 14.03.6

Trey Dockendorf

2014-08-12 01:36:40 UTC

I'm attempting to configure PreemptMode=CHECKPOINT using BLCR. When a job is preempted using CHECKPOINT, the preemption fails with return code 2017 (ESLURM_INVALID_JOB_ID ?) which seems like a bug.

My Preemptable job = 6196

My Preempting job = 6200

Below [1] is log entry related to when 6196 began to be preempted and up to the last mention of that JobID.

I am able to checkpoint the jobs just fine and restart them manually via scontrol. However the jobs are not requeued automatically after being preempted. Is this a bug or some flaw in my setup? Let me know what other information could be useful to debug this. If anyone is successfully using PreemptMode=CHECKPOINT with BLCR (or some other application) I'd be grateful to know.

Here's my relevant parameters:

CheckpointType=checkpoint/blcr
PreemptMode=CHECKPOINT
PreemptType=preempt/qos
ProctrackType=proctrack/cgroup
SchedulerType=sched/backfill
SelectType=select/cons_res
SelectTypeParameters=CR_CPU_Memory,CR_CORE_DEFAULT_DIST_BLOCK
TaskPlugin=task/cgroup
TaskPluginParam=Sched

I'm also seeing the following when executing a manual checkpoint using scontrol:

[2014-08-11T20:24:32.591] debug: _slurm_recv_timeout at 0 of 4, timeout
[2014-08-11T20:24:32.592] error: slurm_receive_msgs: Socket timed out on send/recv operation
[2014-08-11T20:24:32.602] debug3: problems with c0101
[2014-08-11T20:24:32.602] debug2: Tree head got back 1
[2014-08-11T20:24:32.602] error: checkpoint/blcr: error on checkpoint request 3 to 6196.4294967294: Communication connection failure

Thanks,
- Trey

[1]:

slurmctld.log

[2014-08-11T20:26:04.467] debug2: select_p_job_test for job 6200
[2014-08-11T20:26:04.467] debug3: cons_res: _vns: node c0101 no mem 0 < 15900
[2014-08-11T20:26:04.467] debug3: cons_res: _vns: node c0102 no mem 0 < 15900
[2014-08-11T20:26:04.467] debug3: cons_res: _vns: node c0103 no mem 0 < 15900
[2014-08-11T20:26:04.467] debug3: cons_res: _vns: node c0104 no mem 0 < 15900
[2014-08-11T20:26:04.467] debug3: cons_res: _rm_job_from_res: job 6196 action 0
[2014-08-11T20:26:04.467] debug3: cons_res: removed job 6196 from part background row 0
[2014-08-11T20:26:04.467] debug3: cons_res: _vns: node c0102 no mem 0 < 15900
[2014-08-11T20:26:04.467] debug3: cons_res: _vns: node c0103 no mem 0 < 15900
[2014-08-11T20:26:04.467] debug3: cons_res: _vns: node c0104 no mem 0 < 15900
[2014-08-11T20:26:04.468] debug3: cons_res: _vns: node c0101 no mem 0 < 15900
[2014-08-11T20:26:04.468] debug3: cons_res: _rm_job_from_res: job 6196 action 0
[2014-08-11T20:26:04.468] debug3: cons_res: removed job 6196 from part background row 0
[2014-08-11T20:26:04.468] debug3: cons_res: _vns: node c0102 no mem 0 < 15900
[2014-08-11T20:26:04.468] debug3: cons_res: _vns: node c0103 no mem 0 < 15900
[2014-08-11T20:26:04.468] debug3: cons_res: _vns: node c0104 no mem 0 < 15900
[2014-08-11T20:26:04.468] debug3: cons_res: _vns: node c0101 no mem 0 < 15900
[2014-08-11T20:26:04.468] debug3: cons_res: _rm_job_from_res: job 6196 action 0
[2014-08-11T20:26:04.468] debug3: cons_res: removed job 6196 from part background row 0
[2014-08-11T20:26:04.468] debug3: cons_res: _vns: node c0102 no mem 0 < 15900
[2014-08-11T20:26:04.468] debug3: cons_res: _vns: node c0103 no mem 0 < 15900
[2014-08-11T20:26:04.468] debug3: cons_res: _vns: node c0104 no mem 0 < 15900
[2014-08-11T20:26:04.468] debug3: acct_policy_job_runnable_post_select: job 6200: MPN: job_memory set to 15900
[2014-08-11T20:26:04.468] debug: _checkpoint_job_record: checkpoint job record of 6196 to file /var/lib/slurm/checkpoint/6196.ckpt
[2014-08-11T20:26:04.512] checkpoint_op 7 of 6196.0 complete, rc=2017
[2014-08-11T20:26:04.512] debug2: _adjust_limit_usage: job 6196: MPN: job_memory set to 15900
[2014-08-11T20:26:04.512] debug2: _adjust_limit_usage: job 6196: MPN: job_memory set to 15900
[2014-08-11T20:26:04.512] debug3: cons_res: _rm_job_from_res: job 6196 action 0
[2014-08-11T20:26:04.512] debug3: cons_res: removed job 6196 from part background row 0
[2014-08-11T20:26:04.513] debug2: Spawning RPC agent for msg_type REQUEST_KILL_PREEMPTED
[2014-08-11T20:26:04.513] job_signal 9 of running job 6196 successful
[2014-08-11T20:26:04.513] preempted job 6196 had to be killed
[2014-08-11T20:26:04.513] debug3: sched: JobId=6200. State=PENDING. Reason=Resources. Priority=34999669. Partition=serial.
[2014-08-11T20:26:04.513] debug2: got 1 threads to send out
[2014-08-11T20:26:04.513] debug2: Tree head got back 0 looking for 1
[2014-08-11T20:26:04.513] debug3: Tree sending to c0101
[2014-08-11T20:26:04.515] debug2: Tree head got back 1
[2014-08-11T20:26:04.518] debug2: node_did_resp c0101
[2014-08-11T20:26:05.605] debug2: Processing RPC: REQUEST_COMPLETE_BATCH_SCRIPT from uid=0 JobId=6196
[2014-08-11T20:26:05.606] completing job 6196 status 15
[2014-08-11T20:26:05.606] debug2: _slurm_rpc_complete_batch_script JobId=6196: Job/step already completing or completed
[2014-08-11T20:26:05.679] debug2: Processing RPC: MESSAGE_EPILOG_COMPLETE uid=0
[2014-08-11T20:26:05.679] debug2: _slurm_rpc_epilog_complete JobId=6196 Node=c0101 usec=85

slurmd.log (c0101):

[2014-08-11T20:26:04.471] debug3: in the service_connection
[2014-08-11T20:26:04.471] debug2: got this type of message 6016
[2014-08-11T20:26:04.471] debug2: Processing RPC: REQUEST_KILL_PREEMPTED
[2014-08-11T20:26:04.472] [6196] Called _msg_socket_accept
[2014-08-11T20:26:04.472] [6196] Leaving _msg_socket_accept
[2014-08-11T20:26:04.472] [6196] Called _msg_socket_readable
[2014-08-11T20:26:04.485] [6196] Entering _handle_accept (new thread)
[2014-08-11T20:26:04.485] [6196] Identity: uid=0, gid=0
[2014-08-11T20:26:04.485] [6196] Entering _handle_request
[2014-08-11T20:26:04.485] debug2: container signal 994 to job 6196.4294967294
[2014-08-11T20:26:04.485] [6196] Got request 4
[2014-08-11T20:26:04.485] [6196] Handling REQUEST_SIGNAL_CONTAINER
[2014-08-11T20:26:04.485] [6196] _handle_signal_container for step=6196.4294967294 uid=0 signal=994
[2014-08-11T20:26:04.486] [6196] *** JOB 6196 CANCELLED AT 2014-08-11T20:26:04 DUE TO PREEMPTION ***
[2014-08-11T20:26:04.486] [6196] Leaving _handle_request: SLURM_SUCCESS
[2014-08-11T20:26:04.486] [6196] Entering _handle_request
[2014-08-11T20:26:04.486] [6196] Leaving _handle_accept
[2014-08-11T20:26:04.486] debug2: No steps in jobid 6196 to send signal 15
[2014-08-11T20:26:04.486] Job 6196: timeout: sent SIGTERM to 0 active steps
[2014-08-11T20:26:04.486] debug: _rpc_terminate_job, uid = 1469
[2014-08-11T20:26:04.486] debug3: state for jobid 5801: ctime:1407772478 revoked:0 expires:0
[2014-08-11T20:26:04.486] debug3: state for jobid 5994: ctime:1407782947 revoked:0 expires:0
[2014-08-11T20:26:04.486] debug3: state for jobid 5995: ctime:1407782961 revoked:0 expires:0
[2014-08-11T20:26:04.486] debug3: state for jobid 5996: ctime:1407783020 revoked:0 expires:0
[2014-08-11T20:26:04.486] debug3: state for jobid 6106: ctime:1407796252 revoked:0 expires:0
[2014-08-11T20:26:04.486] debug3: state for jobid 6186: ctime:1407805427 revoked:1407805457 expires:1407805457
[2014-08-11T20:26:04.486] debug3: destroying job 6186 state
[2014-08-11T20:26:04.486] debug3: state for jobid 6187: ctime:1407805484 revoked:1407805810 expires:1407805810
[2014-08-11T20:26:04.486] debug3: state for jobid 6195: ctime:1407805985 revoked:1407806153 expires:1407806153
[2014-08-11T20:26:04.486] debug3: state for jobid 6191: ctime:1407806429 revoked:1407806429 expires:1407806429
[2014-08-11T20:26:04.487] debug3: state for jobid 6196: ctime:1407806648 revoked:0 expires:0
[2014-08-11T20:26:04.487] debug: credential for job 6196 revoked
[2014-08-11T20:26:04.487] [6196] Called _msg_socket_accept
[2014-08-11T20:26:04.487] [6196] Leaving _msg_socket_accept
[2014-08-11T20:26:04.487] [6196] Called _msg_socket_readable
[2014-08-11T20:26:04.495] [6196] Entering _handle_accept (new thread)
[2014-08-11T20:26:04.504] [6196] Identity: uid=0, gid=0
[2014-08-11T20:26:04.504] [6196] Entering _handle_request
[2014-08-11T20:26:04.504] debug2: container signal 18 to job 6196.4294967294
[2014-08-11T20:26:04.504] [6196] Got request 4
[2014-08-11T20:26:04.504] [6196] Handling REQUEST_SIGNAL_CONTAINER
[2014-08-11T20:26:04.504] [6196] _handle_signal_container for step=6196.4294967294 uid=0 signal=18
[2014-08-11T20:26:04.504] [6196] killing process 9298 (slurm_task) with signal 18
[2014-08-11T20:26:04.504] [6196] parameter 'freezer.state' set to 'THAWED' for '/cgroup/freezer/slurm/uid_1380/job_6196/step_batch'
[2014-08-11T20:26:04.504] [6196] Sent signal 18 to 6196.4294967294

=============================

Trey Dockendorf
Systems Analyst I
Texas A&M University
Academy for Advanced Telecommunications and Learning Technologies
Phone: (979)458-2396
Email: treydock-mRW4Vj+***@public.gmane.org
Jabber: treydock-mRW4Vj+***@public.gmane.org