Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a rule-of-thumb for how to divide a dataset into training and validation sets? [closed]

People also ask

Is there a rule of thumb for how do you divide a dataset into training and validation sets?

Generally speaking, the rule-of-thumb for splitting data is 80/20 - where 80% of the data is used for training a model, while 20% is used for testing it. This depends on the dataset you're working with, but an 80/20 split is very common and would get you through most datasets just fine.

How do you split data for training and validation?

Generally, the training and validation data set is split into an 80:20 ratio. Thus, 20% of the data is set aside for validation purposes. The ratio changes based on the size of the data.

What is the best training and validation split percentage?

In general, putting 80% of the data in the training set, 10% in the validation set, and 10% in the test set is a good split to start with. The optimum split of the test, validation, and train set depends upon factors such as the use case, the structure of the model, dimension of the data, etc.

Is there an ideal ratio between a training set and validation set?

All Answers (39) Follow 70/30 rule. 70% for training and 30% for validation.


There are two competing concerns: with less training data, your parameter estimates have greater variance. With less testing data, your performance statistic will have greater variance. Broadly speaking you should be concerned with dividing data such that neither variance is too high, which is more to do with the absolute number of instances in each category rather than the percentage.

If you have a total of 100 instances, you're probably stuck with cross validation as no single split is going to give you satisfactory variance in your estimates. If you have 100,000 instances, it doesn't really matter whether you choose an 80:20 split or a 90:10 split (indeed you may choose to use less training data if your method is particularly computationally intensive).

Assuming you have enough data to do proper held-out test data (rather than cross-validation), the following is an instructive way to get a handle on variances:

  1. Split your data into training and testing (80/20 is indeed a good starting point)
  2. Split the training data into training and validation (again, 80/20 is a fair split).
  3. Subsample random selections of your training data, train the classifier with this, and record the performance on the validation set
  4. Try a series of runs with different amounts of training data: randomly sample 20% of it, say, 10 times and observe performance on the validation data, then do the same with 40%, 60%, 80%. You should see both greater performance with more data, but also lower variance across the different random samples
  5. To get a handle on variance due to the size of test data, perform the same procedure in reverse. Train on all of your training data, then randomly sample a percentage of your validation data a number of times, and observe performance. You should now find that the mean performance on small samples of your validation data is roughly the same as the performance on all the validation data, but the variance is much higher with smaller numbers of test samples

You'd be surprised to find out that 80/20 is quite a commonly occurring ratio, often referred to as the Pareto principle. It's usually a safe bet if you use that ratio.

However, depending on the training/validation methodology you employ, the ratio may change. For example: if you use 10-fold cross validation, then you would end up with a validation set of 10% at each fold.

There has been some research into what is the proper ratio between the training set and the validation set:

The fraction of patterns reserved for the validation set should be inversely proportional to the square root of the number of free adjustable parameters.

In their conclusion they specify a formula:

Validation set (v) to training set (t) size ratio, v/t, scales like ln(N/h-max), where N is the number of families of recognizers and h-max is the largest complexity of those families.

What they mean by complexity is:

Each family of recognizer is characterized by its complexity, which may or may not be related to the VC-dimension, the description length, the number of adjustable parameters, or other measures of complexity.

Taking the first rule of thumb (i.e.validation set should be inversely proportional to the square root of the number of free adjustable parameters), you can conclude that if you have 32 adjustable parameters, the square root of 32 is ~5.65, the fraction should be 1/5.65 or 0.177 (v/t). Roughly 17.7% should be reserved for validation and 82.3% for training.


Last year, I took Prof: Andrew Ng’s online machine learning course. His recommendation was:

Training: 60%

Cross validation: 20%

Testing: 20%


Well, you should think about one more thing.

If you have a really big dataset, like 1,000,000 examples, split 80/10/10 may be unnecessary, because 10% = 100,000 examples may be just too much for just saying that model works fine.

Maybe 99/0.5/0.5 is enough because 5,000 examples can represent most of the variance in your data and you can easily tell that model works good based on these 5,000 examples in test and dev.

Don't use 80/20 just because you've heard it's ok. Think about the purpose of the test set.