The typical situation in computational sciences is to have a program that runs for several days/weeks/months straight. As hardware/OS failures are inevitable, one typically utilize checkpointing, i.e. saves the state of the program from time to time. In case of failure, one restarts from the latest checkpoint.
What is the pythonic way to implement checkpointing?
For example, one can dump function's variables directly.
Alternatively, I am thinking of transforming such function into a class (see below). Arguments of the function would become arguments of a constructor. Intermediate data that constitute state of the algorithm would become class attributes. And pickle
module would help with the (de-)serialization.
import pickle
# The file with checkpointing data
chkpt_fname = 'pickle.checkpoint'
class Factorial:
def __init__(self, n):
# Arguments of the algorithm
self.n = n
# Intermediate data (state of the algorithm)
self.prod = 1
self.begin = 0
def get(self, need_restart):
# Last time the function crashed. Need to restore the state.
if need_restart:
with open(chkpt_fname, 'rb') as f:
self = pickle.load(f)
for i in range(self.begin, self.n):
# Some computations
self.prod *= (i + 1)
self.begin = i + 1
# Some part of the computations is completed. Save the state.
with open(chkpt_fname, 'wb') as f:
pickle.dump(self, f)
# Artificial failure of the hardware/OS/Ctrl-C/etc.
if (not need_restart) and (i == 3):
return
return self.prod
if __name__ == '__main__':
f = Factorial(6)
print(f.get(need_restart=False))
print(f.get(need_restart=True))
Usually the answer is serialize with your favourite serialization method be that cpickle json or xml. Pickle has the advantage that you can deserialize a whole object without much extra work.
Additionally it's a good idea to separate your process from your state, so you simply serialize your state object. Lots of objects can't be pickled for example threads, but you may want to run many workers(although beware of the GIL), so pickling will throw an exception of you try to pickle them. You can work around this with _getstate_
and _setstate_
deleting entries which cause a problem- but if you just keep process and state separate this is no longer a problem.
To checkpoint, save your checkpoint file to a known location, when your program begins, check if this file exists, if it doesn't the process hasn't started, otherwise load and run it. Create a thread that periodically checkpoints your running task by draining a queue any worker threads are processing and then saving your state object, then reuse the resume logic you use in case to resume from after checkpointing.
To safely checkpoint you need to insure your program doesn't corrupt the checkpoint file by dying mid pickle. To do this
checkpoint.old
checkpoint.pickle
checkpoint.old
checkpoint.old
, then the program died after step 2, so load checkpoint.old
, rename checkpoint.old
to checkpoint.pickle
and run as normal. If the program died anywhere else, you can simply reload checkpoint.pickle
.If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With