I have a dataset of 22 GB. I would like to process it on my laptop. Of course I can't load it in memory.
I use a lot sklearn but for much smaller datasets.
In this situations the classical approach should be something like.
Read only part of the data -> Partial train your estimator -> delete the data -> read other part of the data -> continue to train your estimator.
I have seen that some sklearn algorithm have the partial fit method that should allow us to train the estimator with various subsamples of the data.
Now I am wondering is there an easy why to do that in sklearn? I am looking for something like
r = read_part_of_data('data.csv') m = sk.my_model `for i in range(n): x = r.read_next_chunk(20 lines) m.partial_fit(x) m.predict(new_x)
Maybe sklearn is not the right tool for these kind of things? Let me know.
LogisticRegression as implemented in scikit-learn won't work on such a big dataset: this is a wrapper for liblinear that requires to load the data in memory prior to fitting. @ogrisel, LogisticRegression in sklearn uses 2nd order optimization methods, so not well suited to large scale data.
Both frameworks can be used with scikit learn. You can load 22 GB of data into Dask or SFrame, then use with sklearn.
the Quick sort algorithm generally is the best for large data sets and long keys.
I've used several scikit-learn classifiers with out-of-core capabilities to train linear models: Stochastic Gradient, Perceptron and Passive Agressive and also Multinomial Naive Bayes on a Kaggle dataset of over 30Gb. All these classifiers share the partial_fit method which you mention. Some behave better than others though.
You can find the methodology, the case study and some good resources in of this post: http://www.opendatascience.com/blog/riding-on-large-data-with-scikit-learn/
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With