I read from somewhere that if you choose a batch size that is a power 2, training will be faster. What is this rule? Is this applicable to other applications? Can you provide a reference paper?
As we have seen, using powers of 2 for the batch size is not readily advantageous in everyday training situations, which leads to the conclusion: Measuring the actual effect on training speed, accuracy and memory consumption when choosing a batch size should be preferred instead of focusing on powers of 2.
Moreover, by using bigger batch sizes (up to a reasonable amount that is allowed by the GPU), we speed up training, as it is equivalent to taking a few big steps, instead of taking many little steps. Therefore with bigger batch sizes, for the same amount of epochs, we can sometimes have a 2x gain in computational time!
The overall idea is to fit your mini-batch entirely in the the CPU/GPU. Since, all the CPU/GPU comes with a storage capacity in power of two, it is advised to keep mini-batch size a power of two.
Batch Size is among the important hyperparameters in Machine Learning. It is the hyperparameter that defines the number of samples to work through before updating the internal model parameters. It can one of the crucial steps to making sure your models hit peak performance.
The notion comes from aligning computations (
C
) onto the physical processors (PP
) of the GPU.
Since the number of PP is often a power of 2, using a number of C
different from a power of 2 leads to poor performance.
You can see the mapping of the C
onto the PP
as a pile of slices of size the number of PP
.
Say you've got 16 PP
.
You can map 16 C
on them : 1 C
is mapped onto 1 PP
.
You can map 32 C
on them : 2 slices of 16 C
, 1 PP
will be responsible for 2 C
.
This is due to the SIMD paradigm used by GPUs. This is often called Data Parallelism : all the PP
do the same thing at the same time but on different data.
Algorithmically speaking, using larger mini-batches allows you to reduce the variance of your stochastic gradient updates (by taking the average of the gradients in the mini-batch), and this in turn allows you to take bigger step-sizes, which means the optimization algorithm will make progress faster.
However, the amount of work done (in terms of number of gradient computations) to reach a certain accuracy in the objective will be the same: with a mini-batch size of n, the variance of the update direction will be reduced by a factor n, so the theory allows you to take step-sizes that are n times larger, so that a single step will take you roughly to the same accuracy as n steps of SGD with a mini-batch size of 1.
As for tensorFlow, I found no evidence of your affirmation, and its a question that has been closed on github : https://github.com/tensorflow/tensorflow/issues/4132
Note that image resized to power of two makes sense (because pooling is generally done in 2X2 windows), but that’s a different thing altogether.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With