Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to calculate optimal batch size

Sometimes I run into a problem:

OOM when allocating tensor with shape

e.q.

OOM when allocating tensor with shape (1024, 100, 160)

Where 1024 is my batch size and I don't know what's the rest. If I reduce the batch size or the number of neurons in the model, it runs fine.

Is there a generic way to calculate optimal batch size based on model and GPU memory, so the program doesn't crash?

In short: I want the largest batch size possible in terms of my model, which will fit into my GPU memory and won't crash the program.

like image 536
Andrzej Gis Avatar asked Oct 09 '17 20:10

Andrzej Gis


People also ask

How do you choose optimal batch size and epochs?

Generally batch size of 32 or 25 is good, with epochs = 100 unless you have large dataset. in case of large dataset you can go with batch size of 10 with epochs b/w 50 to 100. Again the above mentioned figures have worked fine for me. Value for batch size should be (preferred) in powers of 2.

How do you calculate minimum batch size?

What will be the total batch size of the product in Numbers? It is also a simple unit rule calculation and solution is as follows. Divide the value of milligrams by the weight of an individual tablet which is 200 mg in this case. The Required Standard batch size of our product in terms of numbers is 300,000 Tablets.

What is a reasonable batch size?

In general, batch size of 32 is a good starting point, and you should also try with 64, 128, and 256. Other values (lower or higher) may be fine for some data sets, but the given range is generally the best to start experimenting with.


1 Answers

From the recent Deep Learning book by Goodfellow et al., chapter 8:

Minibatch sizes are generally driven by the following factors:

  • Larger batches provide a more accurate estimate of the gradient, but with less than linear returns.
  • Multicore architectures are usually underutilized by extremely small batches. This motivates using some absolute minimum batch size, below which there is no reduction in the time to process a minibatch.
  • If all examples in the batch are to be processed in parallel (as is typically the case), then the amount of memory scales with the batch size. For many hardware setups this is the limiting factor in batch size.
  • Some kinds of hardware achieve better runtime with specific sizes of arrays. Especially when using GPUs, it is common for power of 2 batch sizes to offer better runtime. Typical power of 2 batch sizes range from 32 to 256, with 16 sometimes being attempted for large models.
  • Small batches can offer a regularizing effect (Wilson and Martinez, 2003), perhaps due to the noise they add to the learning process. Generalization error is often best for a batch size of 1. Training with such a small batch size might require a small learning rate to maintain stability because of the high variance in the estimate of the gradient. The total runtime can be very high as a result of the need to make more steps, both because of the reduced learning rate and because it takes more steps to observe the entire training set.

Which in practice usually means "in powers of 2 and the larger the better, provided that the batch fits into your (GPU) memory".

You might want also to consult several good posts here in Stack Exchange:

  • Tradeoff batch size vs. number of iterations to train a neural network
  • Selection of Mini-batch Size for Neural Network Regression
  • How large should the batch size be for stochastic gradient descent?

Just keep in mind that the paper by Keskar et al. 'On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima', quoted by several of the posts above, has received some objections by other respectable researchers of the deep learning community.

Hope this helps...

UPDATE (Dec 2017):

There is a new paper by Yoshua Bengio & team, Three Factors Influencing Minima in SGD (Nov 2017); it is worth reading in the sense that it reports new theoretical & experimental results on the interplay between learning rate and batch size.

UPDATE (Mar 2021):

Of interest here is also another paper from 2018, Revisiting Small Batch Training for Deep Neural Networks (h/t to Nicolas Gervais), which runs contrary to the larger the better advice; quoting from the abstract:

The best performance has been consistently obtained for mini-batch sizes between m=2 and m=32, which contrasts with recent work advocating the use of mini-batch sizes in the thousands.

like image 120
desertnaut Avatar answered Oct 03 '22 16:10

desertnaut