I am trying to train a very large model. Therefore, I can only fit a very small batch size into GPU memory. Working with small batch sizes results with very noisy gradient estimations.
What can I do to avoid this problem?
Minibatch Gradient Descent. Smaller batch sizes are used for two main reasons: Smaller batch sizes are noisy, offering a regularizing effect and lower generalization error. Smaller batch sizes make it easier to fit one batch worth of training data in memory (i.e. when using a GPU).
I have been playing with different values and observed that lower batch size values lead to overfitting. You can see the validation loss starts to increase after 10 epochs indicating the model starts to overfit.
In practical terms, to determine the optimum batch size, we recommend trying smaller batch sizes first(usually 32 or 64), also keeping in mind that small batch sizes require small learning rates. The number of batch sizes should be a power of 2 to take full advantage of the GPUs processing.
Results Of Small vs Large Batch Sizes On Neural Network Training. From the validation metrics, the models trained with small batch sizes generalize well on the validation set. The batch size of 32 gave us the best result.
You can change the iter_size
in the solver parameters.
Caffe accumulates gradients over iter_size
x batch_size
instances in each stochastic gradient descent step.
So increasing iter_size
can also get more stable gradient when you cannot use large batch_size due to the limited memory.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With