What is the relationship between steps and epochs in TensorFlow?

Tags:

tensorflow

I am going through TensorFlow get started tutorial. In the tf.contrib.learn example, these are two lines of code:

input_fn = tf.contrib.learn.io.numpy_input_fn({"x":x}, y, batch_size=4, num_epochs=1000) estimator.fit(input_fn=input_fn, steps=1000)

I am wondering what is the difference between argument steps in the call to fit function and num_epochs in the numpy_input_fn call. Shouldn't there be just one argument? How are they connected?

I have found that code is somehow taking the min of these two as the number of steps in the toy example of the tutorial.

At least, one of the two parameters either num_epochs or steps has to be redundant. We can calculate one from the other. Is there a way I can know how many steps (number of times parameters get updated) my algorithm actually took?

I am curious about which one takes precedence. And does it depend on some other parameters?

303

asked Mar 15 '17 16:03

user1953366

1 Answers

TL;DR: An epoch is when your model goes through your whole training data once. A step is when your model trains on a single batch (or a single sample if you send samples one by one). Training for 5 epochs on a 1000 samples 10 samples per batch will take 500 steps.

The contrib.learn.io module is not documented very well, but it seems that numpy_input_fn() function takes some numpy arrays and batches them together as input for a classificator. So, the number of epochs probably means "how many times to go through the input data I have before stopping". In this case, they feed two arrays of length 4 in 4 element batches, so it will just mean that the input function will do this at most a 1000 times before raising an "out of data" exception. The steps argument in the estimator fit() function is how many times should estimator do the training loop. This particular example is somewhat perverse, so let me make up another one to make things a bit clearer (hopefully).

Lets say you have two numpy arrays (samples and labels) that you want to train on. They are a 100 elements each. You want your training to take batches with 10 samples per batch. So after 10 batches you will go through all of your training data. That is one epoch. If you set your input generator to 10 epochs, it will go through your training set 10 times before stopping, that is it will generate at most a 100 batches.

Again, the io module is not documented, but considering how other input related APIs in tensorflow work, it should be possible to make it generate data for unlimited number of epochs, so the only thing controlling the length of training are going to be the steps. This gives you some extra flexibility on how you want your training to progress. You can go a number of epochs at a time or a number of steps at a time or both or whatever.

125

answered Sep 23 '22 01:09

Mad Wombat

Related questions
                            
                                How does TensorFlow name tensors?
                            
                                Why input is scaled in tf.nn.dropout in tensorflow?
                            
                                Tensorflow variable scope: reuse if variable exists
                            
                                How to convert numpy arrays to standard TensorFlow format?
                            
                                Keras + Tensorflow and Multiprocessing in Python
                            
                                How to manually create a tf.Summary()
                            
                                How to write a custom loss function in Tensorflow?
                            
                                Tensorflow Precision / Recall / F1 score and Confusion matrix
                            
                                TensorFlow ValueError: Cannot feed value of shape (64, 64, 3) for Tensor u'Placeholder:0', which has shape '(?, 64, 64, 3)'
                            
                                Keras - stateful vs stateless LSTMs
                            
                                TensorFlow: numpy.repeat() alternative
                            
                                Clearing Tensorflow GPU memory after model execution
                            
                                How to extract data/labels back from TensorFlow dataset
                            
                                How can I solve 'ran out of gpu memory' in TensorFlow
                            
                                Count number of "True" values in boolean Tensor
                            
                                How to get the dimensions of a tensor (in TensorFlow) at graph construction time?
                            
                                Where should pre-processing and post-processing steps be executed when a TF model is served using TensorFlow serving?
                            
                                Tensorflow: ImportError: libcusolver.so.8.0: cannot open shared object file: No such file or directory
                            
                                Tensorflow Allocation Memory: Allocation of 38535168 exceeds 10% of system memory
                            
                                Make a custom loss function in keras

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With