Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

sklearn and large datasets

I have a dataset of 22 GB. I would like to process it on my laptop. Of course I can't load it in memory.

I use a lot sklearn but for much smaller datasets.

In this situations the classical approach should be something like.

Read only part of the data -> Partial train your estimator -> delete the data -> read other part of the data -> continue to train your estimator.

I have seen that some sklearn algorithm have the partial fit method that should allow us to train the estimator with various subsamples of the data.

Now I am wondering is there an easy why to do that in sklearn? I am looking for something like

r = read_part_of_data('data.csv') m = sk.my_model `for i in range(n):      x = r.read_next_chunk(20 lines)      m.partial_fit(x)  m.predict(new_x) 

Maybe sklearn is not the right tool for these kind of things? Let me know.

like image 567
Donbeo Avatar asked May 26 '14 15:05

Donbeo


People also ask

Does sklearn work for big data?

LogisticRegression as implemented in scikit-learn won't work on such a big dataset: this is a wrapper for liblinear that requires to load the data in memory prior to fitting. @ogrisel, LogisticRegression in sklearn uses 2nd order optimization methods, so not well suited to large scale data.

How much data can sklearn handle?

Both frameworks can be used with scikit learn. You can load 22 GB of data into Dask or SFrame, then use with sklearn.

Which algorithm is best for large datasets?

the Quick sort algorithm generally is the best for large data sets and long keys.


1 Answers

I've used several scikit-learn classifiers with out-of-core capabilities to train linear models: Stochastic Gradient, Perceptron and Passive Agressive and also Multinomial Naive Bayes on a Kaggle dataset of over 30Gb. All these classifiers share the partial_fit method which you mention. Some behave better than others though.

You can find the methodology, the case study and some good resources in of this post: http://www.opendatascience.com/blog/riding-on-large-data-with-scikit-learn/

like image 121
Alexis Perrier Avatar answered Oct 03 '22 07:10

Alexis Perrier