How to get the global_step when restoring checkpoints in Tensorflow?

Tags:

tensorflow

I'm saving my session state like so:

self._saver = tf.saver()
self._saver.save(self._session, '/network', global_step=self._time)

When I later restore I want to get the value of the global_step for the checkpoint I restore from. This is in order to set some hyper parameters from it.

The hacky way to do this would be to run through and parse the file names in the checkpoint directory. But surly there has to be a better, built in way to do this?

588

asked Mar 20 '16 11:03

2 Answers

General pattern is to have a global_step variable to keep track of steps

global_step = tf.Variable(0, name='global_step', trainable=False)
train_op = optimizer.minimize(loss, global_step=global_step)

Then you can save with

saver.save(sess, save_path, global_step=global_step)

When you restore, the value of global_step is restored as well

answered Oct 18 '22 19:10

Yaroslav Bulatov

This is a bit of a hack, but the other answers did not work for me at all

ckpt = tf.train.get_checkpoint_state(checkpoint_dir) 

#Extract from checkpoint filename
step = int(os.path.basename(ckpt.model_checkpoint_path).split('-')[1])

Update 9/2017

I'm not sure if this started working due to updates, but the following method seems to be effective in getting global_step to update and load properly:

Create two ops. One to hold global_step and another to increment it:

    global_step = tf.Variable(0, trainable=False, name='global_step')
    increment_global_step = tf.assign_add(global_step,1,
                                            name = 'increment_global_step')

Now in your training loop run the increment op every time you run your training op.

sess.run([train_op,increment_global_step],feed_dict=feed_dict)

If you ever want to retrieve you global step value as an integer at any point, just use the following command after loading the model:

sess.run(global_step)

This can be useful for creating filenames or calculating what your current epoch is without having a second tensorflow Variable for holding that value. For instance, calculating the current epoch on loading would be something like:

loaded_epoch = sess.run(global_step)//(batch_size*num_train_records)

answered Oct 18 '22 21:10

Lawrence Du

Related questions
                            
                                ValueError when performing matmul with Tensorflow
                            
                                Add Tensorflow pre-processing to existing Keras model (for use in Tensorflow Serving)
                            
                                Keras - class_weight vs sample_weights in the fit_generator
                            
                                What is Bazel in TensorFlow? When do I need to build again?
                            
                                How do you create a boolean mask for a tensor in Keras?
                            
                                How to run Keras on multiple cores?
                            
                                what does class_mode parameter in Keras image_gen.flow_from_directory() signify?
                            
                                How do I check Bazel version?
                            
                                How to understand masked multi-head attention in transformer
                            
                                Eager Execution - InternalError: Could not find valid device for node name: "Sqrt"
                            
                                How to accumulate gradients for large batch sizes in Keras
                            
                                What is tensorflow.compat.as_str()?
                            
                                Tensorflow: Writing an Op in Python
                            
                                Feeding image data in tensorflow for transfer learning
                            
                                Conditional assignment of tensor values in TensorFlow
                            
                                What's the difference between Variable and ResourceVariable in Tensorflow
                            
                                Keras - Validation Loss and Accuracy stuck at 0
                            
                                UsageError: Line magic function `%tensorflow_version` not found
                            
                                How tf.gradients work in TensorFlow
                            
                                Save Keras ModelCheckpoints in Google Cloud Bucket

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With