Trey Dockendorf
2014-08-05 17:10:32 UTC
I have found that in order to support SUSPEND preemption we can not use CR_Memory or Memory as a consumable resource. I've seen that if a preemptable partition has requested 15900MB of RAM on a 16GB node then the job will not be preempted and understandably so. Now I'm looking at how to implement Preemption using Checkpoint. However I'm unable to find any documentation on the exact behavior, configuration and necessary packages.
I have rebuilt the BLCR SRPM for my cluster, and am unsure which packages are necessary for the various systems. I have the SLURM controller, SLURM compute nodes and SLURM submit hosts (login nodes) that do not run the slurm daemon but only submit jobs.
I'm also unsure what the expected behavior of when a job is preempted and checkpointed. Will the job's state be saved? The documentation mentions ImageDir but does not mention how it's set outside of interactive scontrol commands. If I enable PreemptMode=CHECKPOINT, I'm just not clear on what the expected behavior will be for a user's job.
Any guidance on how other sites have implemented BLCR checkpointing, and your experiences would be useful.
Thanks,
- Trey
=============================
Trey Dockendorf
Systems Analyst I
Texas A&M University
Academy for Advanced Telecommunications and Learning Technologies
Phone: (979)458-2396
Email: treydock-mRW4Vj+***@public.gmane.org
Jabber: treydock-mRW4Vj+***@public.gmane.org
I have rebuilt the BLCR SRPM for my cluster, and am unsure which packages are necessary for the various systems. I have the SLURM controller, SLURM compute nodes and SLURM submit hosts (login nodes) that do not run the slurm daemon but only submit jobs.
I'm also unsure what the expected behavior of when a job is preempted and checkpointed. Will the job's state be saved? The documentation mentions ImageDir but does not mention how it's set outside of interactive scontrol commands. If I enable PreemptMode=CHECKPOINT, I'm just not clear on what the expected behavior will be for a user's job.
Any guidance on how other sites have implemented BLCR checkpointing, and your experiences would be useful.
Thanks,
- Trey
=============================
Trey Dockendorf
Systems Analyst I
Texas A&M University
Academy for Advanced Telecommunications and Learning Technologies
Phone: (979)458-2396
Email: treydock-mRW4Vj+***@public.gmane.org
Jabber: treydock-mRW4Vj+***@public.gmane.org