Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How should the learning rate change as the batch size change? [closed]

When I increase/decrease batch size of the mini-batch used in SGD, should I change learning rate? If so, then how?

For reference, I was discussing with someone, and it was said that, when batch size is increased, the learning rate should be decreased by some extent.

My understanding is when I increase batch size, computed average gradient will be less noisy and so I either keep same learning rate or increase it.

Also, if I use an adaptive learning rate optimizer, like Adam or RMSProp, then I guess I can leave learning rate untouched.

Please correct me if I am mistaken and give any insight on this.

like image 928
Tanmay Avatar asked Oct 28 '18 16:10

Tanmay


People also ask

How should the learning rate change as the batch size change?

“Increasing batch size” replaces learning rate decay by batch size increases. “Increased initial learning rate” additionally increases the initial learning rate from 0.1 to 0.5. Finally “Increased momentum coefficient” also increases the momentum coefficient from 0.9 to 0.98.

How does batch size affect learning rate?

For the ones unaware, general rule is “bigger batch size bigger learning rate”. This is just logical because bigger batch size means more confidence in the direction of your “descent” of the error surface while the smaller a batch size is the closer you are to “stochastic” descent (batch size 1).

What happens when you change batch size?

From the above graphs, we can conclude that the larger the batch size: The slower the training loss decreases. The higher the minimum validation loss. The less time it takes to train per epoch.


2 Answers

Theory suggests that when multiplying the batch size by k, one should multiply the learning rate by sqrt(k) to keep the variance in the gradient expectation constant. See page 5 at A. Krizhevsky. One weird trick for parallelizing convolutional neural networks: https://arxiv.org/abs/1404.5997

However, recent experiments with large mini-batches suggest for a simpler linear scaling rule, i.e multiply your learning rate by k when using mini-batch size of kN. See P.Goyal et al.: Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour https://arxiv.org/abs/1706.02677

I would say that with using Adam, Adagrad and other adaptive optimizers, learning rate may remain the same if batch size does not change substantially.

like image 82
Dmytro Prylipko Avatar answered Oct 03 '22 15:10

Dmytro Prylipko


Apart from the papers mentioned in Dmytro's answer, you can refer to the article of: Jastrzębski, S., Kenton, Z., Arpit, D., Ballas, N., Fischer, A., Bengio, Y., & Storkey, A. (2018, October). Width of Minima Reached by Stochastic Gradient Descent is Influenced by Learning Rate to Batch Size Ratio. The authors give the mathematical and empirical foundation to the idea that the ratio of learning rate to batch size influences the generalization capacity of DNN. They show that this ratio plays a major role in the width of the minima found by SGD. The higher ratio the wider is minima and better generalization.

like image 40
SvGA Avatar answered Oct 03 '22 15:10

SvGA