Is scikit-learn suitable for big data tasks?

Tags:

I'm working on a TREC task involving use of machine learning techniques, where the dataset consists of more than 5 terabytes of web documents, from which bag-of-words vectors are planned to be extracted. scikit-learn has a nice set of functionalities that seems to fit my need, but I don't know whether it is going to scale well to handle big data. For example, is HashingVectorizer able to handle 5 terabytes of documents, and is it feasible to parallelize it? Moreover, what are some alternatives out there for large-scale machine learning tasks?

359

asked Jun 10 '13 06:06

chenaren

1 Answers

HashingVectorizer will work if you iteratively chunk your data into batches of 10k or 100k documents that fit in memory for instance.

You can then pass the batch of transformed documents to a linear classifier that supports the partial_fit method (e.g. SGDClassifier or PassiveAggressiveClassifier) and then iterate on new batches.

You can start scoring the model on a held-out validation set (e.g. 10k documents) as you go to monitor the accuracy of the partially trained model without waiting for having seen all the samples.

You can also do this in parallel on several machines on partitions of the data and then average the resulting coef_ and intercept_ attribute to get a final linear model for the all dataset.

I discuss this in this talk I gave in March 2013 at PyData: http://vimeo.com/63269736

There is also sample code in this tutorial on paralyzing scikit-learn with IPython.parallel taken from: https://github.com/ogrisel/parallel_ml_tutorial

answered Sep 18 '22 06:09

ogrisel

Related questions
                            
                                R Shiny: Removing ggplot2 background to make it transparent
                            
                                Sending custom headers through RSpec
                            
                                Enable IntelliJ hotswap of html and javascript files
                            
                                calculate the effective access time
                            
                                What is the difference between using a delegate and using Func<T>/Action<T> in a method signature?
                            
                                How can I configure Rails to raise an error when it hits a deprecation warning?
                            
                                How do I write the move assignment function for this derived class?
                            
                                php and apache thread safe error
                            
                                GDB says "no symbol table," but nm shows file has debug symbols
                            
                                iOS 7 Sprite Kit freeing up memory
                            
                                git ls-files sort by modification time
                            
                                Keeping large dictionary in Python affects application performance

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With