sklearn and large datasets

Tags:

I have a dataset of 22 GB. I would like to process it on my laptop. Of course I can't load it in memory.

I use a lot sklearn but for much smaller datasets.

In this situations the classical approach should be something like.

Read only part of the data -> Partial train your estimator -> delete the data -> read other part of the data -> continue to train your estimator.

I have seen that some sklearn algorithm have the partial fit method that should allow us to train the estimator with various subsamples of the data.

Now I am wondering is there an easy why to do that in sklearn? I am looking for something like

r = read_part_of_data('data.csv') m = sk.my_model `for i in range(n):      x = r.read_next_chunk(20 lines)      m.partial_fit(x)  m.predict(new_x)

Maybe sklearn is not the right tool for these kind of things? Let me know.

567

asked May 26 '14 15:05

Donbeo

1 Answers

I've used several scikit-learn classifiers with out-of-core capabilities to train linear models: Stochastic Gradient, Perceptron and Passive Agressive and also Multinomial Naive Bayes on a Kaggle dataset of over 30Gb. All these classifiers share the partial_fit method which you mention. Some behave better than others though.

You can find the methodology, the case study and some good resources in of this post: http://www.opendatascience.com/blog/riding-on-large-data-with-scikit-learn/

121

answered Oct 03 '22 07:10

Alexis Perrier

Related questions
                            
                                Python: ulimit and nice for subprocess.call / subprocess.Popen?
                            
                                Tensorflow - matmul of input matrix with batch data
                            
                                Python readline() from a string?
                            
                                Pandas finding local max and min
                            
                                Pycharm: set environment variable for run manage.py Task
                            
                                How to test if a given time-stamp is in seconds or milliseconds?
                            
                                How do I return an image in fastAPI?
                            
                                How do I remove all zero elements from a NumPy array?
                            
                                How can I set the x-axis as datetimes on a bokeh plot?
                            
                                permutations of two lists in python
                            
                                OpenAI Gym Atari on Windows
                            
                                Mergesort with Python
                            
                                python pip trouble installing from requirements.txt
                            
                                How do I get the active window on Gnome Wayland?
                            
                                what is XLA_GPU and XLA_CPU for tensorflow
                            
                                Is there an accepted way to use API keys for authentication in Flask? [closed]
                            
                                Is there a convention to distinguish Python integration tests from unit tests?
                            
                                Why no 'const' in Python? [closed]
                            
                                Python packaging: wheels vs tarball (tar.gz)
                            
                                Python socket.error: [Errno 111] Connection refused

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

sklearn and large datasets

Tags:

python

scikit-learn

bigdata

Donbeo

People also ask

1 Answers

Alexis Perrier

Recent Activity

Donate For Us