I thought that batch size is only for performance. The bigger the batch, more images are computed at the same time to train my net. But I realized, if I change my batch size, my net accuracy gets better. So I did not understand what batch size is. Can someone explain me what is batch size?
Caffe is trained using Stochastic-Gradient-Descend (SGD): that is, at each iteration it computes the (stochastic) gradient of the parameters w.r.t the training data and makes a move (=change the parameters) in the direction of the gradient.
Now, if you write the equations of the gradient w.r.t. training data you'll notice that in order to compute the gradient exactly you need to evaluate all your training data at each iteration: this is prohibitively time consuming, especially when the training data gets bigger and bigger.
In order to overcome this, SGD approximates the exact gradient, in a stochastic manner, by sampling only a small portion of the training data at each iteration. This small portion is the batch.
Thus, the larger the batch size the more accurate the gradient estimate at each iteration.
TL;DR: batch size affect the accuracy of the estimated gradient at each iteration, changing the batch size therefore affect the "path" the optimization takes and may change the results of the training process.
Update:
In ICLR 2018 conference an interesting work was presented:
Samuel L. Smith, Pieter-Jan Kindermans, Chris Ying, Quoc V. Le Don't Decay the Learning Rate, Increase the Batch Size.
This work basically relates the effect of changing batch size and learning rate.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With