Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Neural Network : Epoch and Batch Size

I am trying to train a neural network to classify words into different categories. I notice two things:

When I use a smaller batch_size (like 8,16,32) the loss is not decreasing, but rather sporadically varying. When I use a larger batch_size (like 128, 256), the loss is going going down, but very slowly.

More importantly, when I use a larger EPOCH value, my model does a good job at reducing the loss. However I'm using a really large value (EPOCHS = 10000).

Question: How to get the optimal EPOCH and batch_size values?

like image 375
newbieprogrammer Avatar asked Sep 17 '25 06:09

newbieprogrammer


1 Answers

There is no way to decide on these values based on some rules. Unfortunately, the best choices depend on the problem and the task. However, I can give you some insights.

When you train a network, you calculate a gradient which would reduce the loss. In order to do that, you need to backpropagate the loss. Now, ideally, you compute the loss based on all of the samples in your data because then you consider basically every sample and you come up with a gradient that would capture all of your samples. In practice, this is not possible due to the computational complexity of calculating gradient on all samples. Because for every update, you have to compute forward-pass for all your samples. That case would be batch_size = N, where N is the total number of data points you have.

Therefore, we use small batch_size as an approximation! The idea is instead of considering all the samples, we say I compute the gradient based on some small set of samples but the thing is I am losing information regarding the gradient.

Rule of thumb: Smaller batch sizes give noise gradients but they converge faster because per epoch you have more updates. If your batch size is 1 you will have N updates per epoch. If it is N, you will only have 1 update per epoch. On the other hand, larger batch sizes give a more informative gradient but they convergence slower.

That is the reason why for smaller batch sizes, you observe varying losses because the gradient is noisy. And for larger batch sizes, your gradient is informative but you need a lot of epochs since you update less frequently.

The ideal batch size should be the one that gives you informative gradients but also small enough so that you can train the network efficiently. You can only find it by trying actually.

like image 173
Berkay Berabi Avatar answered Sep 21 '25 13:09

Berkay Berabi