Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I know training data is enough for machine learning

For example: If I want to train a classifier (maybe SVM), how many sample do I need to collect? Is there a measure method for this?

like image 962
tidy Avatar asked Jul 15 '14 08:07

tidy


People also ask

How much training data is enough?

The most common way to define whether a data set is sufficient is to apply a 10 times rule. This rule means that the amount of input data (i.e., the number of examples) should be ten times more than the number of degrees of freedom a model has.

How much data do you need to train a machine learning model?

Your test set should be about 25% the size of your training set. So with a dataset that is expected to exhibit annual seasonality, the minimum number of points required to train and test multiple models is 365 + 365/4 ~ 456 observations.

What percentage of data should be training data?

In general, putting 80% of the data in the training set, 10% in the validation set, and 10% in the test set is a good split to start with. The optimum split of the test, validation, and train set depends upon factors such as the use case, the structure of the model, dimension of the data, etc.

Do you need training data for machine learning?

Machine learning models depend on data. Without a foundation of high-quality training data, even the most performant algorithms can be rendered useless. Indeed, robust machine learning models can be crippled when they are trained on inadequate, inaccurate, or irrelevant data in the early stages.


3 Answers

It is not easy to know how many samples you need to collect. However you can follow these steps:

For solving a typical ML problem:

  1. Build a dataset a with a few samples, how many? it will depend on the kind of problem you have, don't spend a lot of time now.
  2. Split your dataset into train, cross, test and build your model.
  3. Now that you've built the ML model, you need to evaluate how good it is. Calculate your test error
  4. If your test error is beneath your expectation, collect new data and repeat steps 1-3 until you hit a test error rate you are comfortable with.

This method will work if your model is not suffering "high bias".

This video from Coursera's Machine Learning course, explains it.

like image 157
jabaldonedo Avatar answered Sep 20 '22 20:09

jabaldonedo


Unfortunately, there is no simple method for this.

The rule of thumb is the bigger, the better, but in practical use, you have to gather the sufficient amount of data. By sufficient I mean covering as big part of modeled space as you consider acceptable.

Also, amount is not everything. The quality of test samples is very important too, i.e. training samples should not contain duplicates.

Personally, when I don't have all possible training data at once, I gather some training data and then train a classifier. Then I classifier quality is not acceptable, I gather more data, etc.

Here is some piece of science about estimating training set quality.

like image 25
Kao Avatar answered Sep 19 '22 20:09

Kao


This depends a lot on the nature of the data and the prediction you are trying to make, but as a simple rule to start with, your training data should be roughly 10X the number of your model parameters. For instance, while training a logistic regression with N features, try to start with 10N training instances.

For an empirical derivation of the "rule of 10", see https://medium.com/@malay.haldar/how-much-training-data-do-you-need-da8ec091e956

like image 35
Malay Haldar Avatar answered Sep 19 '22 20:09

Malay Haldar