Tensorflow: save the model with smallest validation error

Tags:

tensorflow

I ran a training job with tensorflow and got the following curve for loss on validation set. The net starts to overfit after 6000-th iteration. So I'd like to get the model before overfitting.

loss

My training code is something like below:

train_step = ......
summary = tf.scalar_summary(l1_loss.op.name, l1_loss)
summary_writer = tf.train.SummaryWriter("checkpoint", sess.graph)
saver = tf.train.Saver()
for i in xrange(20000):
    batch = get_next_batch(batch_size)
    sess.run(train_step, feed_dict = {x: batch.x, y:batch.y})
    if (i+1) % 100 == 0:
        saver.save(sess, "checkpoint/net", global_step = i+1)
        summary_str = sess.run(summary, feed_dict=validation_feed_dict)
        summary_writer.add_summary(summary_str, i+1)
        summary_writer.flush()

After training finishes, there is only five checkpoints saved (19600, 19700, 19800, 19900, 20000). Is there any way to let tensorflow save checkpoint according to the validation error?

P.S. I know that tf.train.Saver has a max_to_keep argument, which in principal could save all the checkpoints. But that's not I wanted (unless it's the only option). I want the saver keep the checkpoint with the smallest validation loss so far. Is that possible?

898

asked Aug 31 '16 14:08

Ying Xiong

1 Answers

You need to calculate the classification accuracy on the validation-set and keep track of the best one seen so far, and only write the checkpoint once an improvement has been found to the validation accuracy.

If the data-set and/or model is large, then you may have to split the validation-set into batches to fit the computation in memory.

This tutorial shows exactly how to do what you want:

https://github.com/Hvass-Labs/TensorFlow-Tutorials/blob/master/04_Save_Restore.ipynb

It is also available as a short video:

https://www.youtube.com/watch?v=Lx8JUJROkh0

answered Dec 12 '22 11:12

questiondude

Related questions
                            
                                How to create a neural network for regression?
                            
                                How to normalize the Train and Test data using MinMaxScaler sklearn
                            
                                Implement Relu derivative in python numpy
                            
                                CARET. Relationship between data splitting and trainControl
                            
                                What is the difference between detach, clone and deepcopy in Pytorch tensors in detail?
                            
                                Xgboost: what is the difference among bst.best_score, bst.best_iteration and bst.best_ntree_limit?
                            
                                Why does TensorFlow always use GPU 0?
                            
                                HBase & Mahout - Using HBase as a Datastore/source for Mahout - Classification
                            
                                What does the copy_initial_weights documentation mean in the higher library for Pytorch?
                            
                                Anomaly detection using Python [closed]
                            
                                Classifiers confidence in opencv face detector
                            
                                C++ Reinforcement Learning Library [closed]
                            
                                How to update Spark MatrixFactorizationModel for ALS
                            
                                How to tune GaussianNB?
                            
                                Computational Complexity of Self-Attention in the Transformer Model
                            
                                How to extract unsupervised clusters from a Dirichlet Process in PyMC3?
                            
                                Why use a restricted Boltzmann machine rather than a multi-layer perceptron?
                            
                                How do I set up TensorFlow in the Google cloud?
                            
                                Get weight matrices from gensim word2Vec
                            
                                How to compare ROC AUC scores of different binary classifiers and assess statistical significance in Python? (p-value, confidence interval)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With