In a blog post by Ilya Sutskever, A brief overview of Deep Learning, he describes how it is important to choose the right minibatch size to train a deep neural network efficiently. He gives the advice "use the smaller minibatch that runs efficiently on your machine". See the full quote below.
I've seen similar statements by other well-known deep learning researchers, but it is still unclear to me how to find the correct minibatch size. Seeing as a greater minibatch can allow for a greater learning rate, it seems like it requires a lot of experiments to determine whether a certain minibatch size yields a better performance in terms of training speed.
I have a GPU with 4gb of RAM and use the libraries Caffe and Keras. What is in this case a practical heuristic for choosing a good minibatch size given that each observation has a certain memory footprint M
?
Minibatches: Use minibatches. Modern computers cannot be efficient if you process one training case at a time. It is vastly more efficient to train the network on minibatches of 128 examples, because doing so will result in massively greater throughput. It would actually be nice to use minibatches of size 1, and they would probably result in improved performance and lower overfitting; but the benefit of doing so is outweighed the massive computational gains provided by minibatches. But don’t use very large minibatches because they tend to work less well and overfit more. So the practical recommendation is: use the smaller minibatch that runs efficiently on your machine.
When we are training a network, when we compute a forward pass, we have to keep all the intermediate activation outputs for the backwards pass. You simply need to compute how much memory it will cost you to store all the relevant activation outputs in your forward pass, in addition to the other memory constraints (storing your weights on the GPU, etc). So observe that if your net is quite deep, you might want to take a smaller batchsize as you may not have enough memory.
Selecting a minibatch size is a mixture of memory constraints and performance/accuracy (usually evaluated using cross validation).
I personally guess-timate/compute by hand how much GPU memory my forward/backward pass will use up and try out of a few values. If for example the largest I can fit is roughly 128, I may cross validate using 32, 64, 96, etc. just to be thorough and see if I can get better performance. This is usually for a deeper net which is going to push my GPU memory (I also only have a 4 GB card, don't have access to the monster NVIDIA cards).
I think there tends to be a greater emphasis on network architecture, optimization techniques/tricks of the trade, data pre-processing.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With