Checkpointing is a technique that provides fault tolerance for computations. It basically consists of saving a snapshot of the application’s state on persistent storage, so that applications can restart from that point in case of failure. This is particularly important for long running jobs as they are more likely to fail than short running jobs.
Besides fault tolerance, checkpointing can increase job throughput. Jobs that are scheduled for shorter run times are getting started sooner on average than jobs for which the user requests long run times. The mechanism that implements this prioritization is called Backfill.
Checkpointing is available on Gizmo as a beta feature. The feature is currently only available on the new Gizmo Bionic nodes that have Ubuntu 18.04 installed. The current checkpointing implementation is geared towards increasing throughput and improving fault tolerance.
How to Use Checkpointing
You can activate checkpointing by using the
checkpointer command in the shell script that starts your job. After you launched
checkpointer with your job, it waits in the background until the job run time is larger than the checkpoint time. The checkpoint time is set in an environment variable
SLURM_CHECKPOINT. Then it will kill your compute process and flush it to disk and trigger a requeue of your slurm job. When the requeued job starts, it will load all information from disk and continue the computation (likely on a different compute node).
checkpointer command to your script and ensure it is executed before the actual compute script or binary is launched, for example:
> cat runscript.sh #SBATCH --requeue #SBATCH --open-mode=append export SLURM_CHECKPOINT=0-0:45 checkpointer Rscript /fh/fast/..../script.R
Setting the environment variable
SLURM_CHECKPOINT=0-0:45 means that the job will save a checkpoint every 45 min. Adding the directive
--requeue is required for jobs that can be checkpointed and restarted multiple times.
--open-mode=append ensures that the sbatch output file (e.g.
slurm-123456.out) will not be truncated each time the job is restarted.
After this you launch the script with sbatch. Please ensure that the SLURM_CHECKPOINT is smaller than the wall clock time (maximum run time) you request with the -t option. If you request 6 hours (e.g.
sbatch -t 0-6:00) and you set
SLURM_CHECKPOINT to 45 min your job may be restarted 7 times.
sbatch -o out.txt -t 0-6:00 runscript.sh tail -f out.txt
A Simple Example
Create a simple Python script called
looper.py and make it executable with
chmod +x looper.py. The script will simply count to 100 and write each iteration to a new line in a text file in the current folder:
#! /usr/bin/env python3 import os, time, socket outfile='looper.txt' print('writing to %s ...' % outfile) for i in range(1, 101): pid = os.getpid() ppid = os.getppid() line="%s node:%s pid:%s parent-pid:%s" % (str(i),socket.gethostname(),pid,ppid) fh=open(outfile, 'a') fh.write(line+'\n') fh.flush() fh.close() print(line, flush=True) time.sleep(1)
looper.py to your submission script and set your checkpoint time to 10 sec:
> cat runscript.sh #SBATCH --requeue #SBATCH --open-mode=append export SLURM_CHECKPOINT=0-0:00:10 checkpointer ./looper.py
Run it and tail the output file:
sbatch -o out.txt -t 0-1:00 runscript.sh tail -f out.txt
We see that
looper.py is run and then successfully checkpointed to disk and requeued:
### ****************************** ####### SLURM_RESTART_COUNT: SLURMD_NODENAME: gizmok28 writing to looper.txt ... 1 node:gizmok28 pid:6862 parent-pid:6850 2 node:gizmok28 pid:6862 parent-pid:6850 . . 60 node:gizmok28 pid:6862 parent-pid:6850 61 node:gizmok28 pid:6862 parent-pid:6850 62 node:gizmok28 pid:6862 parent-pid:6850 Dumping checkpoint for pid 6862 to /fh/scratch/delete10/_HDC/SciComp/.checkpointer/petersen/47503490/0 Dump done, Exit code: 0 Requeueing ... 47503490 slurmstepd-gizmok28: error: *** JOB 47503490 ON gizmok28 CANCELLED AT 2020-05-11T00:47:23 DUE TO JOB REQUEUE ***
A More Realistic Example with Local Scratch
In the example above
looper.py writes to the current folder which is on a network share. Unfortunatelty checkpointing does not support open file handles to network shares. This does not seem to be a problem at first because
looper.py opens and closes the output file in each loop. But what if
looper.py needed to have a file handle open for longer? Let’s have a look at a slightly modified version:
#! /usr/bin/env python3 import os, time, socket outfile='%s/looper.txt' % os.environ['TMPDIR'] print('writing to %s ...' % outfile) fh=open(outfile, 'a') for i in range(1, 101): pid = os.getpid() ppid = os.getppid() line="%s node:%s pid:%s parent-pid:%s" % (str(i),socket.gethostname(),pid,ppid) fh.write(line+'\n') fh.flush() print(line, flush=True) time.sleep(1) fh.close()
In this case the file handle is open almost for the entire time the script runs. If we were to checkpoint while the file handle was open on a network location, checkpointing may not work in some cases. As a workaround we can temporarily write the file to a local scratch space. The root of the local scratch space of this compute job is accessible as environment variable
$TMPDIR so we write
TMPDIR will be deleted when the compute job ends.
looper.py is writing to a local disk and that data goes away when the job ends, how can we ensure that the data is not lost? If you set environment variable
RESULT_FOLDER to an existing network directory to which you have write permissions,
checkpointer will copy all local data to this network location after the job is finished. For example, use this command to set the result folder to the current working directory before you submit a job:
export RESULT_FOLDER=$(pwd) sbatch -o out.txt -t 0-1:00 runscript.sh tail -f out.txt
Note: another consideration for local scratch space is its very high performance. The space under
/loc allows for up to 1.5 GB/s throughput
Checkpointing can greatly improve job throughput because you can reduce your wall clock time which allows the cluster to start your jobs much sooner. How does wall clock time relate to checkpoint time (aka SLURM_CHECKPOINT)? The wall clock time needs to be longer or equal than the checkpoint time. If you do not set the checkpoint time
checkpointer with just use the wall clock time as checkpoint time and checkpoint jobs 10 min before the wall clock time ends. One big question is how often one should set a checkpoint and how many checkpoints should we have in a single compute job. There are some dependencies:
- Memory: Large memory jobs with several GB of memory utilization take longer to checkpoint as all information in memory needs to be written to disk. If we assume a modest 100 MB/s throughput a job with 20GB will take a little over 3 minutes to flush to network storage.
- Disk Space: Each job that is checkpointed will copy data to a network scratch location (currently under delete10) This data includes content of memory as well as all output files under $TMPDIR
- Slurm Requeue: If a job is requeued it goes back in the queue and might have to wait even though it has a high priority given its low wall clock time. After a job is requeued Slurm waits at least 60 sec until the job can be launched again.
checkpointersupports only simple jobs that run on a single node.
- The submission script should not contain complex structures or multiple steps.
- Error recovery is only partially tested. It will make multiple attempts to recover from failures but much more testing needs to be done.
- To ensure debugging,
checkpointerwill execute the first checkpoint no longer than 60 sec after job start to ensure that you always get immediate feedback if checkpointing works or not.
We have tested this process with R and Python and would like to continue testing with other tools
SciComp to request assistance and discuss which environment is best for your needs.
Updated: May 14, 2020Edit this Page via GitHub Comment by Filing an Issue Have Questions? Ask them here.