So, imagine having access to sufficient data (millions of datapoints for training and testing) of sufficient quality. Please ignore concept drift for now and assume the data static and does not change over time. Does it even make sense to use all of that data in terms of the quality of the model?
Brain and Webb (http://www.csse.monash.edu.au/~webb/Files/BrainWebb99.pdf) have included some results on experimenting with different dataset sizes. Their tested algorithms converge to being somewhat stable after training with 16,000 or 32,000 datapoints. However, since we're living in the big data world we have access to data sets of millions of points, so the paper is somewhat relevant but hugely outdated.
Is there any know more recent research on the impact of dataset sizes on learning algorithms (Naive Bayes, Decision Trees, SVM, neural networks etc).
Why am I asking this? Imagine a system with limited storage and a huge amount of unique models (thousands of models with their own unique dataset) and no way of increasing the storage. So limiting the size of a dataset is important.
Any thoughts or research on this?
The most common way to define whether a data set is sufficient is to apply a 10 times rule. This rule means that the amount of input data (i.e., the number of examples) should be ten times more than the number of degrees of freedom a model has.
Big data refers to vast amounts of data that traditional storage methods cannot handle. Machine learning is the ability of computer systems to learn to make predictions from observations and data. Machine learning can use the information provided by the study of big data to generate valuable business insights.
For very small datasets, Bayesian methods are generally the best in class, although the results can be sensitive to your choice of prior.
the Quick sort algorithm generally is the best for large data sets and long keys.
I did my master's thesis on this subject so I happen to know quite a bit about it.
In a few words in the first part of my master's thesis, I took some really big datasets (~5,000,000 samples) and tested some machine learning algorithms on them by learning on different % of the dataset (learning curves).
The hypothesis I made (I was using scikit-learn mostly) was not to optimize the parameters, using the default parameters for the algorithms (I had to make this hypothesis for practical reasons, without optimization some simulations took already more than 24 hours on a cluster).
The first thing to note is that, effectively, every method will lead to a plateau for a certain portion of the dataset. You cannot, however, draw conclusions about the effective number of samples it takes for a plateau to be reached for the following reasons :
We can, however, differentiate two different types of algorithms that will have a different behavior: parametric (Linear, ...) and non-parametric (Random Forest, ...) models. If a plateau is reached with a non-parametric that means the rest of the dataset is "useless". As you can see while the Lightning method reaches a plateau very soon on my picture that doesn't mean that the dataset doesn't have anything left to offer but more than that is the best that the method can do. That's why non-parametric methods work the best when the model to get is complicated and can really benefit from a large number of training samples.
So as for your questions :
See above.
Yes, it all depends on what is inside the dataset.
For me, the only rule of thumb is to go with cross-validation. If you are in the situation in which you think that you will use 20,000 or 30,000 samples you're often in a case where cross-validation is not a problem. In my thesis, I computed the accuracy of my methods on a test set, and when I did not notice a significant improvement I determined the number of samples it took to get there. As I said there are some trends that you can observe (parametric methods tend to saturate more quickly than non-parametric)
Sometimes when the dataset is not large enough you can take every datapoint you have and still have room for improvement if you had a larger dataset. In my thesis with no optimisation on the parameters, the Cifar-10 dataset behaved that way, even after 50,000 none of my algorithm had already converged.
I'd add that optimizing the parameters of the algorithms have a big influence on the speed of convergence to a plateau, but it requires another step of cross-validation.
Your last sentence is highly related to the subject of my thesis, but for me, it was more related to the memory and time available for doing the ML tasks. (As if you cover less than the whole dataset you'll have a smaller memory requirement and it will be faster). About that, the concept of "core-sets" could really be interesting for you.
I hope I could help you, I had to stop because I could on and on about that but if you need more clarifications I'd be happy to help.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With