For example: If I want to train a classifier (maybe SVM), how many sample do I need to collect? Is there a measure method for this?

It is not easy to know how many samples you need to collect. However you can follow these steps: For solving a typical ML problem: <ol> <li>Build a dataset a with a few samples, how many? it will depend on the kind of problem you have, don't spend a lot of time now.</li> <li>Split your dataset into train, cross, test and build your model.</li> <li>Now that you've built the ML model, you need to evaluate how good it is. Calculate your test error</li> <li>If your test error is beneath your expectation, collect new data and repeat steps 1-3 until you hit a test error rate you are comfortable with.</li> </ol> This method will work if your model is not suffering "high bias". This video from Coursera's Machine Learning course, explains it.

Unfortunately, there is no simple method for this. The rule of thumb is the bigger, the better, but in practical use, you have to gather the sufficient amount of data. By sufficient I mean covering as big part of modeled space as you consider acceptable. Also, amount is not everything. The quality of test samples is very important too, i.e. training samples should not contain duplicates. Personally, when I don't have all possible training data at once, I gather some training data and then train a classifier. Then I classifier quality is not acceptable, I gather more data, etc. Here is some piece of science about estimating training set quality.

How can I know training data is enough for machine learning

3 Answers

It is not easy to know how many samples you need to collect. However you can follow these steps:

For solving a typical ML problem:

Build a dataset a with a few samples, how many? it will depend on the kind of problem you have, don't spend a lot of time now.
Split your dataset into train, cross, test and build your model.
Now that you've built the ML model, you need to evaluate how good it is. Calculate your test error
If your test error is beneath your expectation, collect new data and repeat steps 1-3 until you hit a test error rate you are comfortable with.

This method will work if your model is not suffering "high bias".

This video from Coursera's Machine Learning course, explains it.

157

answered Sep 20 '22 20:09

jabaldonedo

Unfortunately, there is no simple method for this.

The rule of thumb is the bigger, the better, but in practical use, you have to gather the sufficient amount of data. By sufficient I mean covering as big part of modeled space as you consider acceptable.

Also, amount is not everything. The quality of test samples is very important too, i.e. training samples should not contain duplicates.

Personally, when I don't have all possible training data at once, I gather some training data and then train a classifier. Then I classifier quality is not acceptable, I gather more data, etc.

Here is some piece of science about estimating training set quality.

answered Sep 19 '22 20:09

Kao

This depends a lot on the nature of the data and the prediction you are trying to make, but as a simple rule to start with, your training data should be roughly 10X the number of your model parameters. For instance, while training a logistic regression with N features, try to start with 10N training instances.

For an empirical derivation of the "rule of 10", see https://medium.com/@malay.haldar/how-much-training-data-do-you-need-da8ec091e956

answered Sep 19 '22 20:09

Malay Haldar

Related questions
                            
                                Maximal vs. Closed Patterns in Association Rule Mining
                            
                                How to convert 2d numpy array into binary indicator matrix for max value
                            
                                What is the difference between Deep Learning and traditional Artificial Neural Network machine learning? [closed]
                            
                                Neural Network: Solving XOR
                            
                                Using Keras, how can I input an X_train of images (more than a thousand images)?
                            
                                Keras Image data generator throwing no files found error?
                            
                                python scikit learn, get documents per topic in LDA
                            
                                How does Keras calculate the accuracy?
                            
                                Number of feature maps produced after each convolution layer in CNN's
                            
                                Can flow_from_directory get train and validation data from the same directory in Keras?
                            
                                How to plot confusion matrix for prefetched dataset in Tensorflow
                            
                                Matlab:K-means clustering
                            
                                How to forecast in python using machine learning , from a given set of geographical data?
                            
                                How to preprocess data for machine learning? [closed]
                            
                                What is a Recurrent Neural Network, what is a Long Short Term Memory (LSTM) network, and is it always better? [closed]
                            
                                How can I calculate the point between two overlapping linear datasets?
                            
                                Element-wise constraints in scipy.optimize.minimize
                            
                                NLTK. Detecting whether a sentence is Interogative or Not?
                            
                                Understanding a multilayer perceptron network
                            
                                Writing array to csv python (one column)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How can I know training data is enough for machine learning

Tags:

machine-learning

classification

sample-data

tidy

People also ask

3 Answers

jabaldonedo

Kao

Malay Haldar

Recent Activity

Donate For Us