I'm working on a TREC task involving use of machine learning techniques, where the dataset consists of more than 5 terabytes of web documents, from which bag-of-words vectors are planned to be extracted. scikit-learn
has a nice set of functionalities that seems to fit my need, but I don't know whether it is going to scale well to handle big data. For example, is HashingVectorizer
able to handle 5 terabytes of documents, and is it feasible to parallelize it? Moreover, what are some alternatives out there for large-scale machine learning tasks?
Scikit-learn is steadily evolving with new models, efficiency improvements on speed and memory, and large data capabilities. Although scikit-learn is optimized for smaller data, it does offer a decent set of algorithms for out-of-core classification, regression, clustering and decomposition.
Scikit-learn (Sklearn) is the most useful and robust library for machine learning in Python. It provides a selection of efficient tools for machine learning and statistical modeling including classification, regression, clustering and dimensionality reduction via a consistence interface in Python.
Consistently the betaworks data science team uses Scikit-learn for a variety of tasks.
The scikit-learn library in Python is built upon the SciPy stack for efficient numerical computation. It is a fully featured library for general machine learning and provides many useful utilities in developing deep learning models.
HashingVectorizer
will work if you iteratively chunk your data into batches of 10k or 100k documents that fit in memory for instance.
You can then pass the batch of transformed documents to a linear classifier that supports the partial_fit
method (e.g. SGDClassifier
or PassiveAggressiveClassifier
) and then iterate on new batches.
You can start scoring the model on a held-out validation set (e.g. 10k documents) as you go to monitor the accuracy of the partially trained model without waiting for having seen all the samples.
You can also do this in parallel on several machines on partitions of the data and then average the resulting coef_
and intercept_
attribute to get a final linear model for the all dataset.
I discuss this in this talk I gave in March 2013 at PyData: http://vimeo.com/63269736
There is also sample code in this tutorial on paralyzing scikit-learn with IPython.parallel taken from: https://github.com/ogrisel/parallel_ml_tutorial
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With