Does the dataset size influence a machine learning algorithm?

Tags:

So, imagine having access to sufficient data (millions of datapoints for training and testing) of sufficient quality. Please ignore concept drift for now and assume the data static and does not change over time. Does it even make sense to use all of that data in terms of the quality of the model?

Brain and Webb (http://www.csse.monash.edu.au/~webb/Files/BrainWebb99.pdf) have included some results on experimenting with different dataset sizes. Their tested algorithms converge to being somewhat stable after training with 16,000 or 32,000 datapoints. However, since we're living in the big data world we have access to data sets of millions of points, so the paper is somewhat relevant but hugely outdated.

Is there any know more recent research on the impact of dataset sizes on learning algorithms (Naive Bayes, Decision Trees, SVM, neural networks etc).

When does a learning algorithm converge to a certain stable model for which more data does not increase the quality anymore?
Can it happen after 50,000 datapoints, or maybe after 200,000 or only after 1,000,000?
Is there a rule of thumb?
Or maybe there is no way for an algorithm to converge to a stable model, to a certain equilibrium?

Why am I asking this? Imagine a system with limited storage and a huge amount of unique models (thousands of models with their own unique dataset) and no way of increasing the storage. So limiting the size of a dataset is important.

Any thoughts or research on this?

796

asked Sep 04 '14 12:09

user3354890

1 Answers

I did my master's thesis on this subject so I happen to know quite a bit about it.

In a few words in the first part of my master's thesis, I took some really big datasets (~5,000,000 samples) and tested some machine learning algorithms on them by learning on different % of the dataset (learning curves). Results for HIGGS

The hypothesis I made (I was using scikit-learn mostly) was not to optimize the parameters, using the default parameters for the algorithms (I had to make this hypothesis for practical reasons, without optimization some simulations took already more than 24 hours on a cluster).

The first thing to note is that, effectively, every method will lead to a plateau for a certain portion of the dataset. You cannot, however, draw conclusions about the effective number of samples it takes for a plateau to be reached for the following reasons :

Every dataset is different, for really simple datasets they can give you nearly everything they have to offer with 10 samples while some still have something to reveal after 12000 samples (See the Higgs dataset in my example above).
The number of samples in a dataset is arbitrary, in my thesis I tested a dataset with wrong samples that were only added to mess with the algorithms.

We can, however, differentiate two different types of algorithms that will have a different behavior: parametric (Linear, ...) and non-parametric (Random Forest, ...) models. If a plateau is reached with a non-parametric that means the rest of the dataset is "useless". As you can see while the Lightning method reaches a plateau very soon on my picture that doesn't mean that the dataset doesn't have anything left to offer but more than that is the best that the method can do. That's why non-parametric methods work the best when the model to get is complicated and can really benefit from a large number of training samples.

So as for your questions :

See above.
Yes, it all depends on what is inside the dataset.
For me, the only rule of thumb is to go with cross-validation. If you are in the situation in which you think that you will use 20,000 or 30,000 samples you're often in a case where cross-validation is not a problem. In my thesis, I computed the accuracy of my methods on a test set, and when I did not notice a significant improvement I determined the number of samples it took to get there. As I said there are some trends that you can observe (parametric methods tend to saturate more quickly than non-parametric)
Sometimes when the dataset is not large enough you can take every datapoint you have and still have room for improvement if you had a larger dataset. In my thesis with no optimisation on the parameters, the Cifar-10 dataset behaved that way, even after 50,000 none of my algorithm had already converged.

I'd add that optimizing the parameters of the algorithms have a big influence on the speed of convergence to a plateau, but it requires another step of cross-validation.

Your last sentence is highly related to the subject of my thesis, but for me, it was more related to the memory and time available for doing the ML tasks. (As if you cover less than the whole dataset you'll have a smaller memory requirement and it will be faster). About that, the concept of "core-sets" could really be interesting for you.

I hope I could help you, I had to stop because I could on and on about that but if you need more clarifications I'd be happy to help.

answered Sep 20 '22 20:09

AdrienNK

Related questions
                            
                                What is the design & architecture behind facebook's status update mechanism?
                            
                                Fastest way to check if an array is sorted
                            
                                How is it possible to do binary search on a singly-linked list in O(n) time?
                            
                                Lazily Tying the Knot for 1 Dimensional Dynamic Programming
                            
                                Fuzzy string matching in Python
                            
                                Explain why time complexity for summing digits in a number of length N is O(logN)
                            
                                Levenshtein distance based methods Vs Soundex
                            
                                What are the options for generating user friendly alpha numeric IDs (like business id, SKU)
                            
                                The Most Efficient Algorithm to Find First Prefix-Match From a Sorted String Array?
                            
                                Is there a nearest-key map datastructure?
                            
                                Proof that Fowler's money allocation algorithm is correct
                            
                                Map from integer ranges to arbitrary single integers
                            
                                Find the minimum number of elements required so that their sum equals or exceeds S
                            
                                Interview Question... Trying to work it out, but couldn't get an efficient solution
                            
                                Algorithm for finding the ratio of two floating-point numbers?
                            
                                Convert a number from Base B1 to Base B2 without using any intermediate base
                            
                                Count the number of "holes" in a bitmap
                            
                                Generating all possible permutations of a list recursively
                            
                                Reshape MATLAB vector in Row-wise manner
                            
                                Randomly choosing from a list with weighted probabilities

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Does the dataset size influence a machine learning algorithm?

Tags:

algorithm

machine-learning

dataset

svm

bigdata

user3354890

People also ask

1 Answers

AdrienNK

Recent Activity

Donate For Us