Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Does the dataset size influence a machine learning algorithm?

So, imagine having access to sufficient data (millions of datapoints for training and testing) of sufficient quality. Please ignore concept drift for now and assume the data static and does not change over time. Does it even make sense to use all of that data in terms of the quality of the model?

Brain and Webb (http://www.csse.monash.edu.au/~webb/Files/BrainWebb99.pdf) have included some results on experimenting with different dataset sizes. Their tested algorithms converge to being somewhat stable after training with 16,000 or 32,000 datapoints. However, since we're living in the big data world we have access to data sets of millions of points, so the paper is somewhat relevant but hugely outdated.

Is there any know more recent research on the impact of dataset sizes on learning algorithms (Naive Bayes, Decision Trees, SVM, neural networks etc).

  1. When does a learning algorithm converge to a certain stable model for which more data does not increase the quality anymore?
  2. Can it happen after 50,000 datapoints, or maybe after 200,000 or only after 1,000,000?
  3. Is there a rule of thumb?
  4. Or maybe there is no way for an algorithm to converge to a stable model, to a certain equilibrium?

Why am I asking this? Imagine a system with limited storage and a huge amount of unique models (thousands of models with their own unique dataset) and no way of increasing the storage. So limiting the size of a dataset is important.

Any thoughts or research on this?

like image 796
user3354890 Avatar asked Sep 04 '14 12:09

user3354890


People also ask

What is a good dataset size for machine learning?

The most common way to define whether a data set is sufficient is to apply a 10 times rule. This rule means that the amount of input data (i.e., the number of examples) should be ten times more than the number of degrees of freedom a model has.

Does machine learning depend on big data?

Big data refers to vast amounts of data that traditional storage methods cannot handle. Machine learning is the ability of computer systems to learn to make predictions from observations and data. Machine learning can use the information provided by the study of big data to generate valuable business insights.

Which machine learning algorithm is best for small dataset?

For very small datasets, Bayesian methods are generally the best in class, although the results can be sensitive to your choice of prior.

Which algorithm is best for large datasets?

the Quick sort algorithm generally is the best for large data sets and long keys.


1 Answers

I did my master's thesis on this subject so I happen to know quite a bit about it.

In a few words in the first part of my master's thesis, I took some really big datasets (~5,000,000 samples) and tested some machine learning algorithms on them by learning on different % of the dataset (learning curves). Results for HIGGS

The hypothesis I made (I was using scikit-learn mostly) was not to optimize the parameters, using the default parameters for the algorithms (I had to make this hypothesis for practical reasons, without optimization some simulations took already more than 24 hours on a cluster).

The first thing to note is that, effectively, every method will lead to a plateau for a certain portion of the dataset. You cannot, however, draw conclusions about the effective number of samples it takes for a plateau to be reached for the following reasons :

  • Every dataset is different, for really simple datasets they can give you nearly everything they have to offer with 10 samples while some still have something to reveal after 12000 samples (See the Higgs dataset in my example above).
  • The number of samples in a dataset is arbitrary, in my thesis I tested a dataset with wrong samples that were only added to mess with the algorithms.

We can, however, differentiate two different types of algorithms that will have a different behavior: parametric (Linear, ...) and non-parametric (Random Forest, ...) models. If a plateau is reached with a non-parametric that means the rest of the dataset is "useless". As you can see while the Lightning method reaches a plateau very soon on my picture that doesn't mean that the dataset doesn't have anything left to offer but more than that is the best that the method can do. That's why non-parametric methods work the best when the model to get is complicated and can really benefit from a large number of training samples.

So as for your questions :

  1. See above.

  2. Yes, it all depends on what is inside the dataset.

  3. For me, the only rule of thumb is to go with cross-validation. If you are in the situation in which you think that you will use 20,000 or 30,000 samples you're often in a case where cross-validation is not a problem. In my thesis, I computed the accuracy of my methods on a test set, and when I did not notice a significant improvement I determined the number of samples it took to get there. As I said there are some trends that you can observe (parametric methods tend to saturate more quickly than non-parametric)

  4. Sometimes when the dataset is not large enough you can take every datapoint you have and still have room for improvement if you had a larger dataset. In my thesis with no optimisation on the parameters, the Cifar-10 dataset behaved that way, even after 50,000 none of my algorithm had already converged.

I'd add that optimizing the parameters of the algorithms have a big influence on the speed of convergence to a plateau, but it requires another step of cross-validation.

Your last sentence is highly related to the subject of my thesis, but for me, it was more related to the memory and time available for doing the ML tasks. (As if you cover less than the whole dataset you'll have a smaller memory requirement and it will be faster). About that, the concept of "core-sets" could really be interesting for you.

I hope I could help you, I had to stop because I could on and on about that but if you need more clarifications I'd be happy to help.

like image 65
AdrienNK Avatar answered Sep 20 '22 20:09

AdrienNK