Generally speaking, the rule-of-thumb for splitting data is 80/20 - where 80% of the data is used for training a model, while 20% is used for testing it. This depends on the dataset you're working with, but an 80/20 split is very common and would get you through most datasets just fine.
Generally, the training and validation data set is split into an 80:20 ratio. Thus, 20% of the data is set aside for validation purposes. The ratio changes based on the size of the data.
In general, putting 80% of the data in the training set, 10% in the validation set, and 10% in the test set is a good split to start with. The optimum split of the test, validation, and train set depends upon factors such as the use case, the structure of the model, dimension of the data, etc.
All Answers (39) Follow 70/30 rule. 70% for training and 30% for validation.
There are two competing concerns: with less training data, your parameter estimates have greater variance. With less testing data, your performance statistic will have greater variance. Broadly speaking you should be concerned with dividing data such that neither variance is too high, which is more to do with the absolute number of instances in each category rather than the percentage.
If you have a total of 100 instances, you're probably stuck with cross validation as no single split is going to give you satisfactory variance in your estimates. If you have 100,000 instances, it doesn't really matter whether you choose an 80:20 split or a 90:10 split (indeed you may choose to use less training data if your method is particularly computationally intensive).
Assuming you have enough data to do proper held-out test data (rather than cross-validation), the following is an instructive way to get a handle on variances:
You'd be surprised to find out that 80/20 is quite a commonly occurring ratio, often referred to as the Pareto principle. It's usually a safe bet if you use that ratio.
However, depending on the training/validation methodology you employ, the ratio may change. For example: if you use 10-fold cross validation, then you would end up with a validation set of 10% at each fold.
There has been some research into what is the proper ratio between the training set and the validation set:
The fraction of patterns reserved for the validation set should be inversely proportional to the square root of the number of free adjustable parameters.
In their conclusion they specify a formula:
Validation set (v) to training set (t) size ratio, v/t, scales like ln(N/h-max), where N is the number of families of recognizers and h-max is the largest complexity of those families.
What they mean by complexity is:
Each family of recognizer is characterized by its complexity, which may or may not be related to the VC-dimension, the description length, the number of adjustable parameters, or other measures of complexity.
Taking the first rule of thumb (i.e.validation set should be inversely proportional to the square root of the number of free adjustable parameters), you can conclude that if you have 32 adjustable parameters, the square root of 32 is ~5.65, the fraction should be 1/5.65 or 0.177 (v/t). Roughly 17.7% should be reserved for validation and 82.3% for training.
Last year, I took Prof: Andrew Ng’s online machine learning course. His recommendation was:
Training: 60%
Cross validation: 20%
Testing: 20%
Well, you should think about one more thing.
If you have a really big dataset, like 1,000,000 examples, split 80/10/10 may be unnecessary, because 10% = 100,000 examples may be just too much for just saying that model works fine.
Maybe 99/0.5/0.5 is enough because 5,000 examples can represent most of the variance in your data and you can easily tell that model works good based on these 5,000 examples in test and dev.
Don't use 80/20 just because you've heard it's ok. Think about the purpose of the test set.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With