Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Training, Validation, Testing Batch Size Ratio

I'm doing transfer learning using Inception on Tensorflow, this is the training code that I followed : https://raw.githubusercontent.com/tensorflow/hub/master/examples/image_retraining/retrain.py

At the bottom part of the code, we can specify the parameters according to our dataset. (there are training, val, test percentage and training, val, test batch size)
Let's say I have a very large dataset (1 mil) and I already set the training, validation, testing percentage to 75:15:10.

But I have no idea how to set the batch parameters correctly :

  • train_batch_size
  • validation_batch_size
  • test_batch_size

For now, I set the train_batch_size to 64, do I need to set the same value for validation_batch_size? Or should it be bigger or smaller than the train_batch_size?

like image 885
gameon67 Avatar asked Jan 29 '19 02:01

gameon67


People also ask

What size should batch validation be?

Pilot batch size should correspond to at least 10% of the production scale batch, i.e. such that the multiplication factor for the scale-up does not exceed 10. For oral solid dosage forms this size should generally be 10% of production scale or 100,000 units whichever is the greater1.

What is a good batch size for training?

The batch size affects some indicators such as overall training time, training time per epoch, quality of the model, and similar. Usually, we chose the batch size as a power of two, in the range between 16 and 512. But generally, the size of 32 is a rule of thumb and a good initial choice.

What is a good ratio between training data and test data?

A commonly used ratio is 80:20, which means 80% of the data is for training and 20% for testing. Other ratios such as 70:30, 60:40, and even 50:50 are also used in practice.

What percentage of data should be training validation and test sets?

In general, putting 80% of the data in the training set, 10% in the validation set, and 10% in the test set is a good split to start with.


1 Answers

You can follow the advice from the other answers for the dataset split ratio. However, the batch size has absolutely nothing to do with how you've split your datasets.

The batch size determines how many training examples are processed in parallel for training/inference. The batch size at training time can affect how fast and how well your training converges. You can find a discussion of this effect here. Thus, for train_batch_size, it's worth picking a batch size that is neither too small nor too large (as discussed in the previously linked discussion). For some applications, using the largest possible training batches can actually be desirable, but in general, you select it through experiments and validation.

However, for validation_batch_size and test_batch_size, you should pick the largest batch size that your hardware can handle without running out of memory and crashing. Finding this is usually a simple trial and error process. The larger your batch size at inference time, the faster it will be, since more inputs can be processed in parallel.

EDIT: Here's an additional useful link (Pg. 276) for the training batch size trade-off from Goodfellow et al's deep learning book.

like image 158
Proyag Avatar answered Nov 15 '22 07:11

Proyag