Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to checkpoint a long-running function pythonically?

The typical situation in computational sciences is to have a program that runs for several days/weeks/months straight. As hardware/OS failures are inevitable, one typically utilize checkpointing, i.e. saves the state of the program from time to time. In case of failure, one restarts from the latest checkpoint.

What is the pythonic way to implement checkpointing?

For example, one can dump function's variables directly.

Alternatively, I am thinking of transforming such function into a class (see below). Arguments of the function would become arguments of a constructor. Intermediate data that constitute state of the algorithm would become class attributes. And pickle module would help with the (de-)serialization.

import pickle

# The file with checkpointing data
chkpt_fname = 'pickle.checkpoint'

class Factorial:
    def __init__(self, n):
        # Arguments of the algorithm
        self.n = n

        # Intermediate data (state of the algorithm)
        self.prod = 1
        self.begin = 0

    def get(self, need_restart):
        # Last time the function crashed. Need to restore the state.
        if need_restart:
            with open(chkpt_fname, 'rb') as f:
                self = pickle.load(f)

        for i in range(self.begin, self.n):
            # Some computations
            self.prod *= (i + 1)
            self.begin = i + 1

            # Some part of the computations is completed. Save the state.
            with open(chkpt_fname, 'wb') as f:
                pickle.dump(self, f)

            # Artificial failure of the hardware/OS/Ctrl-C/etc.
            if (not need_restart) and (i == 3):
                return

        return self.prod


if __name__ == '__main__':
    f = Factorial(6)
    print(f.get(need_restart=False))
    print(f.get(need_restart=True))
like image 312
Alexander Pozdneev Avatar asked Dec 08 '15 12:12

Alexander Pozdneev


1 Answers

Usually the answer is serialize with your favourite serialization method be that cpickle json or xml. Pickle has the advantage that you can deserialize a whole object without much extra work.

Additionally it's a good idea to separate your process from your state, so you simply serialize your state object. Lots of objects can't be pickled for example threads, but you may want to run many workers(although beware of the GIL), so pickling will throw an exception of you try to pickle them. You can work around this with _getstate_ and _setstate_ deleting entries which cause a problem- but if you just keep process and state separate this is no longer a problem.

To checkpoint, save your checkpoint file to a known location, when your program begins, check if this file exists, if it doesn't the process hasn't started, otherwise load and run it. Create a thread that periodically checkpoints your running task by draining a queue any worker threads are processing and then saving your state object, then reuse the resume logic you use in case to resume from after checkpointing.

To safely checkpoint you need to insure your program doesn't corrupt the checkpoint file by dying mid pickle. To do this

  1. Create checkpoint at temporary location and finish writing
  2. Rename last checkpoint to checkpoint.old
  3. Rename new checkpoint to checkpoint.pickle
  4. Remove checkpoint.old
  5. If you start the program and there is no checkpoint but there is a checkpoint.old, then the program died after step 2, so load checkpoint.old, rename checkpoint.old to checkpoint.pickle and run as normal. If the program died anywhere else, you can simply reload checkpoint.pickle.
like image 168
awiebe Avatar answered Oct 22 '22 13:10

awiebe