What is the batchSize in TensorFlow's model.fit() function?

Tags:

After having defined a model with TensorFlow.js, you can run model.fit() to train it. This function takes a number of parameters, including a configuration object. This object has a property batchSize. The documentation on model.fit() just says:

Number of samples per gradient update. If unspecified, it will default to 32.

While this is probably a technically correct answer, it doesn't really help. Why should I change this number? I have realized that if I increase it, training gets faster, and if I decrease it, it gets slower. But what exactly am I changing here? Why would I change it? What do I need to watch out for?

Any hints on this?

700

asked Apr 04 '20 13:04

Golo Roden

1 Answers

The batch size is the number of training examples that you use to perform one step of stochastic gradient descent (SGD).

What is SGD? SGD is gradient descent (GD), but, rather than using all of your training data to compute the gradient of your loss function with respect to the parameters of the network, you only use a subset of the training dataset. Hence the adjective "stochastic", because, by using only a subset of the training data, you will be approximating stochastically (i.e. you will introduce noise) the gradient that would be computed by using all of your training data, which would be considered the "actual" gradient of the loss function with respect to the parameters.

Why should I change this number? I have realized that if I increase it, training gets faster, and if I decrease it, it gets slower. But what exactly am I changing here? Why would I change it? What do I need to watch out for?

If the batch size is too small, e.g. 1, then you will be computing the gradient only with one training example. This can make your training loss to oscillate a lot, because, each time, you approximate the gradient with only one training example, which is often not representative of the whole training data. So, as a rule of thumb, the more training examples you use, the better you approximate the gradient (that would correspond to all training examples), so this can potentially lead to faster convergence. However, in practice, if you use many training examples, it can also be computationally expensive. For example, imagine your training data is composed of millions of training examples. In that case, to perform a single step of gradient descent, you would need to go through all these training examples, which can take a lot of time. So, you would potentially need to wait a lot of time only to see how the parameters of your model are updated. This may not be ideal.

To conclude, small batch sizes can make your training process oscillate, and this can make your loss function take a lot of time to reach a local minimum. However, a huge batch size may also not be desirable, because it can also take a lot of time.

Typical values of the batch size are 32, 64, and 128. Why? People just use these numbers because they empirically seem to be good compromises (in terms of convergence, training time, etc.) between tiny batch sizes and huge batch sizes.

194

answered Sep 23 '22 23:09

nbro

Related questions
                            
                                What to do when Seq2Seq network repeats words over and over in output?
                            
                                Cannot run tensorflow on GPU
                            
                                mAP decreasing with training tensorflow object detection SSD
                            
                                How to use dataset.shard in tensorflow?
                            
                                tf object detection api - extract feature vector for each detection bbox
                            
                                Tensorflow.js save model using node
                            
                                How to fix 'ValueError: Empty Training Data' error in tensorflow
                            
                                python3 recognizes tensorflow, but doesn't recognize any of its attributes
                            
                                What is the purpose of tf.compat?
                            
                                What is the proper way to benchmark part of tensorflow graph?
                            
                                About tensorflow graph: what am I wrong with this program?
                            
                                Guided Back-propagation in TensorFlow
                            
                                Session graph is empty
                            
                                "Cannot convert a ndarray into a Tensor or Operation." error when trying to fetch a value from session.run in tensorflow
                            
                                TensorFlow: Is there a way to convert a list with None type to a Tensor?
                            
                                How to set Tensorflow dynamic_rnn, zero_state without a fixed batch_size?
                            
                                How to dynamically freeze weights after compiling model in Keras?
                            
                                Use both sample_weight and class_weight simultaneously
                            
                                How to install CPU version of tensorflow using conda
                            
                                `loss` passed to Optimizer.compute_gradients should be a function when eager execution is enabled

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What is the batchSize in TensorFlow's model.fit() function?

Tags:

tensorflow

tensorflow.js

Golo Roden

People also ask

1 Answers

nbro

Recent Activity

Donate For Us