For example: If I want to train a classifier (maybe SVM), how many sample do I need to collect? Is there a measure method for this?
The most common way to define whether a data set is sufficient is to apply a 10 times rule. This rule means that the amount of input data (i.e., the number of examples) should be ten times more than the number of degrees of freedom a model has.
Your test set should be about 25% the size of your training set. So with a dataset that is expected to exhibit annual seasonality, the minimum number of points required to train and test multiple models is 365 + 365/4 ~ 456 observations.
In general, putting 80% of the data in the training set, 10% in the validation set, and 10% in the test set is a good split to start with. The optimum split of the test, validation, and train set depends upon factors such as the use case, the structure of the model, dimension of the data, etc.
Machine learning models depend on data. Without a foundation of high-quality training data, even the most performant algorithms can be rendered useless. Indeed, robust machine learning models can be crippled when they are trained on inadequate, inaccurate, or irrelevant data in the early stages.
It is not easy to know how many samples you need to collect. However you can follow these steps:
For solving a typical ML problem:
This method will work if your model is not suffering "high bias".
This video from Coursera's Machine Learning course, explains it.
Unfortunately, there is no simple method for this.
The rule of thumb is the bigger, the better, but in practical use, you have to gather the sufficient amount of data. By sufficient I mean covering as big part of modeled space as you consider acceptable.
Also, amount is not everything. The quality of test samples is very important too, i.e. training samples should not contain duplicates.
Personally, when I don't have all possible training data at once, I gather some training data and then train a classifier. Then I classifier quality is not acceptable, I gather more data, etc.
Here is some piece of science about estimating training set quality.
This depends a lot on the nature of the data and the prediction you are trying to make, but as a simple rule to start with, your training data should be roughly 10X the number of your model parameters. For instance, while training a logistic regression with N features, try to start with 10N training instances.
For an empirical derivation of the "rule of 10", see https://medium.com/@malay.haldar/how-much-training-data-do-you-need-da8ec091e956
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With