Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Tensorflow, missing checkpoint files. Does saver only allow for keeping 5 check points?

I am working with tensorflow and have been training some models and saving them after each epoch using the tf.saver() method. I am able to save and load models just fine and I am doing this in the usual way.

with tf.Graph().as_default(), tf.Session() as session:
    initialiser = tf.random_normal_initializer(config.mean, config.std)

    with tf.variable_scope("model",reuse=None, initializer=initialiser):
        m = a2p(session, config, training=True)

    saver = tf.train.Saver()   
    ckpt = tf.train.get_checkpoint_state(model_dir)
    if ckpt and tf.gfile.Exists(ckpt.model_checkpoint_path)
        saver.restore(session, ckpt.model_checkpoint_path)
    ...
    for i in range(epochs):
       runepoch()
       save_path = saver.save(session, '%s.ckpt'%i)

My code is set up to save a model for each epoch which should be labelled accordingly. However, I have noticed that after fifteen epochs of training I only have check point files for the last five epochs (10, 11, 12, 13,14). The documentation doesn't say anything about this so I am at a loss as to why it is happening.

Does the saver only allow for keeping five checkpoints or have I done something wrong?

Is there a way to make sure that all of the checkpoints are kept?

like image 953
Eli Avatar asked Jul 08 '16 11:07

Eli


2 Answers

You can choose how many checkpoints to save when you create your Saver object by setting the max_to_keep argument which defaults to 5.

saver = tf.train.Saver(max_to_keep=10000)
like image 132
Styrke Avatar answered Oct 23 '22 01:10

Styrke


setting max_to_keep=None actually makes the Saver keep all checkpoints. For eg.,

saver = tf.train.Saver(max_to_keep=None)
like image 2
Rajarshee Mitra Avatar answered Oct 23 '22 02:10

Rajarshee Mitra