Choosing number of Steps per Epoch

Tags:

If I want to train a model with train_generator, is there a significant difference between choosing

10 Epochs with 500 Steps each

and

100 Epochs with 50 Steps each

Currently I am training for 10 epochs, because each epoch takes a long time, but any graph showing improvement looks very "jumpy" because I only have 10 datapoints. I figure I can get a smoother graph if I use 100 Epochs, but I want to know first if there is any downside to this

296

asked Apr 19 '18 13:04

n.st

4 Answers

Based on what you said it sounds like you need a larger batch_size, and of course there are implications with that which could impact the steps_per_epoch and number of epochs.

To solve for jumping-around

A larger batch size will give you a better gradient and will help to prevent jumping around
You may also want to consider a smaller learning rate, or a learning rate scheduler (or decay) to allow the network to "settle in" as it trains

Implications of a larger batch-size

Too large of a batch_size can produce memory problems, especially if you are using a GPU. Once you exceed the limit, dial it back until it works. This will help you find the max batch-size that your system can work with.
Too large of a batch size can get you stuck in a local minima, so if your training get stuck, I would reduce it some. Imagine here you are over-correcting the jumping-around and it's not jumping around enough to further minimize the loss function.

When to reduce epochs

If your train error is very low, yet your test/validation is very high, then you have over-fit the model with too many epochs.
The best way to find the right balance is to use early-stopping with a validation test set. Here you can specify when to stop training, and save the weights for the network that gives you the best validation loss. (I highly recommend using this always)

When to adjust steps-per-epoch

Traditionally, the steps per epoch is calculated as train_length // batch_size, since this will use all of the data points, one batch size worth at a time.
If you are augmenting the data, then you can stretch this a tad (sometimes I multiply that function above by 2 or 3 etc. But, if it's already training for too long, then I would just stick with the traditional approach.

114

answered Oct 16 '22 20:10

Chris Farr

Steps per epoch does not connect to epochs.

Naturally what you want if to 1 epoch your generator pass through all of your training data one time. To achieve this you should provide steps per epoch equal to number of batches like this:

steps_per_epoch = int( np.ceil(x_train.shape[0] / batch_size) )

as from above equation the largest the batch_size, the lower the steps_per_epoch.

Next you will choose epoch based on chosen validation. (choose what you think best)

answered Oct 16 '22 20:10

Ioannis Nasios

The Steps per epoch denote the number of batches to be selected for one epoch. If 500 steps are selected then the network will train for 500 batches to complete one epoch. If we select the large number of epochs it can be computational

answered Oct 16 '22 19:10

Manish Vasandnani

steps_per_epoch tells the network how many batches to include in an epoch.

By definition, an epoch is considered complete when the dataset has been run through the model once in its entirety. With other words, it means that all training samples have been run through the model. (For further discussion, let us assume that the size of the training examples is 'm').

Also by definition, we know that `batch size' is between [1, m].

Below is what TensorFlow page says about steps_per_epoch

If you want to run training only on a specific number of batches from this Dataset, you can pass the steps_per_epoch argument, which specifies how many training steps the model should run using this Dataset before moving on to the next epoch.

Now suppose that your training_size, m = 128 and batch_size, b = 16, which means that your data is grouped into 8 batches. According to the above quote, the maximum value you can assign to steps_per_epoch is 8, as computed in one of the answers by @Ioannis Nasios.

However, it is not necessary that you set the value to 8 only (as in our example). You can choose any value between 1 and 8. You just need to be aware that the training will be performed only with this number of batches.

The reason for the jumpy error values could be the size of your batch, as correctly mentioned in this answer by @Chris Farr.

Training & evaluation from tf.data Datasets

If you do this, the dataset is not reset at the end of each epoch, instead we just keep drawing the next batches. The dataset will eventually run out of data (unless it is an infinitely-looping dataset).

The advantage of a low value for steps_per_epoch is that different epochs are trained with different data sets (a kind of regularization). However, if you have a limited training size, using only a subset of stacks would not be what we want. It is a decision one has to make.

answered Oct 16 '22 19:10

Harsha Y

Related questions
                            
                                Save Tensorflow graph for viewing in Tensorboard without summary operations
                            
                                What does this tensorflow message mean? Any side effect? Was the installation successful?
                            
                                ValueError: Duplicate plugins for name projector
                            
                                Converting from Pandas dataframe to TensorFlow tensor object
                            
                                Should I use @tf.function for all functions?
                            
                                When global_variables_initializer() is actually required
                            
                                What is the TensorFlow checkpoint meta file?
                            
                                Machine Learning (tensorflow / sklearn) in Django?
                            
                                ValueError: Output tensors to a Model must be the output of a TensorFlow `Layer`
                            
                                TensorFlow Variables and Constants
                            
                                In TensorFlow, what is the argument 'axis' in the function 'tf.one_hot'
                            
                                tensorflow: what's the difference between tf.nn.dropout and tf.layers.dropout
                            
                                What does the function control_dependencies do?
                            
                                How to perform k-fold cross validation with tensorflow?
                            
                                Output from TensorFlow `py_func` has unknown rank/shape
                            
                                Tensorflow: How does tf.get_variable work?
                            
                                Issue feeding a list into feed_dict in TensorFlow
                            
                                TensorBoard doesn't show all data points
                            
                                How to display custom images in TensorBoard using Keras?
                            
                                how to install tensorflow on anaconda python 3.6

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Choosing number of Steps per Epoch

Tags:

machine-learning

neural-network

tensorflow

deep-learning

keras