Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is using batch size as 'powers of 2' faster on tensorflow?

I read from somewhere that if you choose a batch size that is a power 2, training will be faster. What is this rule? Is this applicable to other applications? Can you provide a reference paper?

like image 884
Chaine Avatar asked Jun 11 '17 11:06

Chaine


People also ask

Should batch sizes be powers of 2?

As we have seen, using powers of 2 for the batch size is not readily advantageous in everyday training situations, which leads to the conclusion: Measuring the actual effect on training speed, accuracy and memory consumption when choosing a batch size should be preferred instead of focusing on powers of 2.

Does increasing batch size increase speed?

Moreover, by using bigger batch sizes (up to a reasonable amount that is allowed by the GPU), we speed up training, as it is equivalent to taking a few big steps, instead of taking many little steps. Therefore with bigger batch sizes, for the same amount of epochs, we can sometimes have a 2x gain in computational time!

Why do we use multiples of 2 for mini batch size?

The overall idea is to fit your mini-batch entirely in the the CPU/GPU. Since, all the CPU/GPU comes with a storage capacity in power of two, it is advised to keep mini-batch size a power of two.

Does batch size influence model performance?

Batch Size is among the important hyperparameters in Machine Learning. It is the hyperparameter that defines the number of samples to work through before updating the internal model parameters. It can one of the crucial steps to making sure your models hit peak performance.


2 Answers

The notion comes from aligning computations (C) onto the physical processors (PP) of the GPU.

Since the number of PP is often a power of 2, using a number of C different from a power of 2 leads to poor performance.

You can see the mapping of the C onto the PP as a pile of slices of size the number of PP. Say you've got 16 PP. You can map 16 C on them : 1 C is mapped onto 1 PP. You can map 32 C on them : 2 slices of 16 C , 1 PP will be responsible for 2 C.

This is due to the SIMD paradigm used by GPUs. This is often called Data Parallelism : all the PP do the same thing at the same time but on different data.

like image 80
mrk Avatar answered Oct 04 '22 18:10

mrk


Algorithmically speaking, using larger mini-batches allows you to reduce the variance of your stochastic gradient updates (by taking the average of the gradients in the mini-batch), and this in turn allows you to take bigger step-sizes, which means the optimization algorithm will make progress faster.

However, the amount of work done (in terms of number of gradient computations) to reach a certain accuracy in the objective will be the same: with a mini-batch size of n, the variance of the update direction will be reduced by a factor n, so the theory allows you to take step-sizes that are n times larger, so that a single step will take you roughly to the same accuracy as n steps of SGD with a mini-batch size of 1.

As for tensorFlow, I found no evidence of your affirmation, and its a question that has been closed on github : https://github.com/tensorflow/tensorflow/issues/4132

Note that image resized to power of two makes sense (because pooling is generally done in 2X2 windows), but that’s a different thing altogether.

like image 43
mxdbld Avatar answered Oct 04 '22 16:10

mxdbld