This question was made before the documentation for save and restore was available. For now I would consider this question deprecated and say people to rely on the official documentation on Save and Restore
Gist of old question:
I got TF working fine for the CIFAR Tutorial. I've changed the code to save the
train_dir
(directory with checkpoint and models) to a know location.Which brings me to my question :How can I pause and resume some training with TF ?
TensorFlow uses Graph-like computation, Nodes(Ops) and Edges(Variables aka states) and it provide a Saver
for it's Vars.
So as it's distributed computation you can run part of a graph in one machine/processor and the rest in the other, meanwhile you can save the state(Vars) and feed it next time to continue your work.
saver.save(sess, 'my-model', global_step=0) ==> filename: 'my-model-0'
...
saver.save(sess, 'my-model', global_step=1000) ==> filename: 'my-model-1000'
which later you can use
tf.train.Saver.restore(sess, save_path)
to restore your saved Vars.
Saver Usage
As described by Hamed, the right way to do it on tensorflow is
saver=tf.train.Saver()
save_path='checkpoints/'
-----> while training you can store using
saver.save(sess=session,save_path=save_path)
-----> and restore
saver.restore(sess=session,save_path=save_path)
this will load the model where you last saved it and will the training(if you want) from there only.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With