Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to Pause / Resume Training in Tensorflow

Tags:

tensorflow

This question was made before the documentation for save and restore was available. For now I would consider this question deprecated and say people to rely on the official documentation on Save and Restore

Gist of old question:

I got TF working fine for the CIFAR Tutorial. I've changed the code to save the train_dir (directory with checkpoint and models) to a know location.

Which brings me to my question :How can I pause and resume some training with TF ?

like image 777
OddNorg Avatar asked Nov 13 '15 09:11

OddNorg


2 Answers

TensorFlow uses Graph-like computation, Nodes(Ops) and Edges(Variables aka states) and it provide a Saver for it's Vars.

So as it's distributed computation you can run part of a graph in one machine/processor and the rest in the other, meanwhile you can save the state(Vars) and feed it next time to continue your work.

saver.save(sess, 'my-model', global_step=0) ==> filename: 'my-model-0'
...
saver.save(sess, 'my-model', global_step=1000) ==> filename: 'my-model-1000'

which later you can use

tf.train.Saver.restore(sess, save_path)

to restore your saved Vars.

Saver Usage

like image 100
Hamed MP Avatar answered Nov 20 '22 11:11

Hamed MP


As described by Hamed, the right way to do it on tensorflow is

    saver=tf.train.Saver()
    save_path='checkpoints/'
    -----> while training you can store using
    saver.save(sess=session,save_path=save_path)
    -----> and restore
    saver.restore(sess=session,save_path=save_path)

this will load the model where you last saved it and will the training(if you want) from there only.

like image 24
Saurabh Kumar Avatar answered Nov 20 '22 12:11

Saurabh Kumar