Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the batchSize in TensorFlow's model.fit() function?

After having defined a model with TensorFlow.js, you can run model.fit() to train it. This function takes a number of parameters, including a configuration object. This object has a property batchSize. The documentation on model.fit() just says:

Number of samples per gradient update. If unspecified, it will default to 32.

While this is probably a technically correct answer, it doesn't really help. Why should I change this number? I have realized that if I increase it, training gets faster, and if I decrease it, it gets slower. But what exactly am I changing here? Why would I change it? What do I need to watch out for?

Any hints on this?

like image 700
Golo Roden Avatar asked Apr 04 '20 13:04

Golo Roden


People also ask

What is batch size in model fit?

The batch size is a number of samples processed before the model is updated. The number of epochs is the number of complete passes through the training dataset. The size of a batch must be more than or equal to one and less than or equal to the number of samples in the training dataset.

What is the default batch size in the fit function of keras?

Number of samples per batch. If unspecified, batch_size will default to 32.

What does model fit do in TensorFlow?

fit() is for training the model with the given inputs (and corresponding training labels). evaluate() is for evaluating the already trained model using the validation (or test) data and the corresponding labels. Returns the loss value and metrics values for the model.

What is model fit () in Python?

model. fit() : fit training data. For supervised learning applications, this accepts two arguments: the data X and the labels y (e.g. model. fit(X, y) ). For unsupervised learning applications, this accepts only a single argument, the data X (e.g. model.


1 Answers

The batch size is the number of training examples that you use to perform one step of stochastic gradient descent (SGD).

What is SGD? SGD is gradient descent (GD), but, rather than using all of your training data to compute the gradient of your loss function with respect to the parameters of the network, you only use a subset of the training dataset. Hence the adjective "stochastic", because, by using only a subset of the training data, you will be approximating stochastically (i.e. you will introduce noise) the gradient that would be computed by using all of your training data, which would be considered the "actual" gradient of the loss function with respect to the parameters.

Why should I change this number? I have realized that if I increase it, training gets faster, and if I decrease it, it gets slower. But what exactly am I changing here? Why would I change it? What do I need to watch out for?

If the batch size is too small, e.g. 1, then you will be computing the gradient only with one training example. This can make your training loss to oscillate a lot, because, each time, you approximate the gradient with only one training example, which is often not representative of the whole training data. So, as a rule of thumb, the more training examples you use, the better you approximate the gradient (that would correspond to all training examples), so this can potentially lead to faster convergence. However, in practice, if you use many training examples, it can also be computationally expensive. For example, imagine your training data is composed of millions of training examples. In that case, to perform a single step of gradient descent, you would need to go through all these training examples, which can take a lot of time. So, you would potentially need to wait a lot of time only to see how the parameters of your model are updated. This may not be ideal.

To conclude, small batch sizes can make your training process oscillate, and this can make your loss function take a lot of time to reach a local minimum. However, a huge batch size may also not be desirable, because it can also take a lot of time.

Typical values of the batch size are 32, 64, and 128. Why? People just use these numbers because they empirically seem to be good compromises (in terms of convergence, training time, etc.) between tiny batch sizes and huge batch sizes.

like image 194
nbro Avatar answered Sep 23 '22 23:09

nbro