I read from somewhere that if you choose a batch size that is a power 2, training will be faster. What is this rule? Is this applicable to other applications? Can you provide a reference paper?

<blockquote> The notion comes from aligning computations (<code>C</code>) onto the physical processors (<code>PP</code>) of the GPU. </blockquote> Since the number of PP is often a power of 2, using a number of <code>C</code> different from a power of 2 leads to poor performance. You can see the mapping of the <code>C</code> onto the <code>PP</code> as a pile of slices of size the number of <code>PP</code>. Say you've got 16 <code>PP</code>. You can map 16 <code>C</code> on them : 1 <code>C</code> is mapped onto 1 <code>PP</code>. You can map 32 <code>C</code> on them : 2 slices of 16 <code>C</code> , 1 <code>PP</code> will be responsible for 2 <code>C</code>. This is due to the SIMD paradigm used by GPUs. This is often called Data Parallelism : all the <code>PP</code> do the same thing at the same time but on different data.

Is using batch size as 'powers of 2' faster on tensorflow?

2 Answers

The notion comes from aligning computations (C) onto the physical processors (PP) of the GPU.

Since the number of PP is often a power of 2, using a number of C different from a power of 2 leads to poor performance.

You can see the mapping of the C onto the PP as a pile of slices of size the number of PP. Say you've got 16 PP. You can map 16 C on them : 1 C is mapped onto 1 PP. You can map 32 C on them : 2 slices of 16 C , 1 PP will be responsible for 2 C.

This is due to the SIMD paradigm used by GPUs. This is often called Data Parallelism : all the PP do the same thing at the same time but on different data.

answered Oct 04 '22 18:10

mrk

Algorithmically speaking, using larger mini-batches allows you to reduce the variance of your stochastic gradient updates (by taking the average of the gradients in the mini-batch), and this in turn allows you to take bigger step-sizes, which means the optimization algorithm will make progress faster.

However, the amount of work done (in terms of number of gradient computations) to reach a certain accuracy in the objective will be the same: with a mini-batch size of n, the variance of the update direction will be reduced by a factor n, so the theory allows you to take step-sizes that are n times larger, so that a single step will take you roughly to the same accuracy as n steps of SGD with a mini-batch size of 1.

As for tensorFlow, I found no evidence of your affirmation, and its a question that has been closed on github : https://github.com/tensorflow/tensorflow/issues/4132

Note that image resized to power of two makes sense (because pooling is generally done in 2X2 windows), but that’s a different thing altogether.

answered Oct 04 '22 16:10

mxdbld

Related questions
                            
                                Where is the tensorflow session in Keras
                            
                                How to use tensorflow feature_columns as input to a keras model
                            
                                Tf 2.0 : RuntimeError: GradientTape.gradient can only be called once on non-persistent tapes
                            
                                TensorBoard: How to plot histogram for gradients?
                            
                                Tensorflow model for OCR
                            
                                Class wise precision and recall for multi class classification in Tensorflow?
                            
                                Is tf.layers.dense a single layer?
                            
                                DQN - Q-Loss not converging
                            
                                Tensorflow "map operation" for tensor?
                            
                                How can I separate runs of my TensorFlow code in TensorBoard?
                            
                                Tensorflow equivalent to numpy.diff
                            
                                Keras deep learning model to android
                            
                                How to change python version in Anaconda?
                            
                                overcome Graphdef cannot be larger than 2GB in tensorflow
                            
                                How to monitor gradient vanish and explosion in keras with tensorboard?
                            
                                Does TensorFlow by default use all available GPUs in the machine?
                            
                                How to install latest cuDNN to conda?
                            
                                Get last output of dynamic_rnn in tensorflow?
                            
                                expected ndim=3, found ndim=2
                            
                                Regarding the use of tf.train.shuffle_batch() to create batches

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Is using batch size as 'powers of 2' faster on tensorflow?

Tags:

machine-learning

tensorflow

deep-learning

batchsize

Chaine

People also ask

2 Answers

mrk

mxdbld

Recent Activity

Donate For Us