I am trying to train a neural network to classify words into different categories. I notice two things:
When I use a smaller batch_size
(like 8,16,32) the loss is not decreasing, but rather sporadically varying. When I use a larger batch_size
(like 128, 256), the loss is going going down, but very slowly.
More importantly, when I use a larger EPOCH
value, my model does a good job at reducing the loss. However I'm using a really large value (EPOCHS = 10000
).
Question:
How to get the optimal EPOCH
and batch_size
values?
There is no way to decide on these values based on some rules. Unfortunately, the best choices depend on the problem and the task. However, I can give you some insights.
When you train a network, you calculate a gradient which would reduce the loss. In order to do that, you need to backpropagate the loss. Now, ideally, you compute the loss based on all of the samples in your data because then you consider basically every sample and you come up with a gradient that would capture all of your samples. In practice, this is not possible due to the computational complexity of calculating gradient on all samples. Because for every update, you have to compute forward-pass for all your samples. That case would be batch_size = N, where N is the total number of data points you have.
Therefore, we use small batch_size as an approximation! The idea is instead of considering all the samples, we say I compute the gradient based on some small set of samples but the thing is I am losing information regarding the gradient.
Rule of thumb: Smaller batch sizes give noise gradients but they converge faster because per epoch you have more updates. If your batch size is 1 you will have N updates per epoch. If it is N, you will only have 1 update per epoch. On the other hand, larger batch sizes give a more informative gradient but they convergence slower.
That is the reason why for smaller batch sizes, you observe varying losses because the gradient is noisy. And for larger batch sizes, your gradient is informative but you need a lot of epochs since you update less frequently.
The ideal batch size should be the one that gives you informative gradients but also small enough so that you can train the network efficiently. You can only find it by trying actually.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With