Possibility to apply online algorithms on big data files with sklearn?

Tags:

I would like to apply fast online dimensionality reduction techniques such as (online/mini-batch) Dictionary Learning on big text corpora. My input data naturally do not fit in the memory (this is why i want to use an online algorithm) so i am looking for an implementation that can iterate over a file rather than loading everything in memory. Is it possible to do this with sklearn ? are there alternatives ?

Thanks register

557

asked Sep 17 '12 13:09

register

3 Answers

For some algorithms supporting partial_fit, it would be possible to write an outer loop in a script to do out-of-core, large scale text classification. However there are some missing elements: a dataset reader that iterates over the data on the disk as folders of flat files or a SQL database server, or NoSQL store or a Solr index with stored fields for instance. We also lack an online text vectorizer.

Here is a sample integration template to explain how it would fit together.

import numpy as np
from sklearn.linear_model import Perceptron

from mymodule import SomeTextDocumentVectorizer
from mymodule import DataSetReader

dataset_reader = DataSetReader('/path/to/raw/data')

expected_classes = dataset_reader.get_all_classes()  # need to know the possible classes ahead of time

feature_extractor = SomeTextDocumentVectorizer()
classifier = Perceptron()

dataset_reader = DataSetReader('/path/to/raw/data')

for i, (documents, labels) in enumerate(dataset_reader.iter_chunks()):

    vectors = feature_extractor.transform(documents)
    classifier.partial_fit(vectors, labels, classes=expected_classes)

    if i % 100 == 0:
        # dump model to be able to monitor quality and later analyse convergence externally
        joblib.dump(classifier, 'model_%04d.pkl' % i)

The dataset reader class is application specific and will probably never make it into scikit-learn (except maybe for a folder of flat text files or CSV files that would not require to add a new dependency to the library).

The text vectorizer part is more problematic. The current vectorizer does not have a partial_fit method because of the way we build the in-memory vocabulary (a python dict that is trimmed depending on max_df and min_df). We could maybe build one using an external store and drop the max_df and min_df features.

Alternatively we could build an HashingTextVectorizer that would use the hashing trick to drop the dictionary requirements. None of those exist at the moment (although we already have some building blocks such as a murmurhash wrapper and a pull request for hashing features).

In the mean time I would advise you to have a look at Vowpal Wabbit and maybe those python bindings.

Edit: The sklearn.feature_extraction.FeatureHasher class has been merged into the master branch of scikit-learn and will be available in the next release (0.13). Have a look at the documentation on feature extraction.

Edit 2: 0.13 is now released with both FeatureHasher and HashingVectorizerthat can directly deal with text data.

Edit 3: there is now an example on out-of-core learning with the Reuters dataset in the official example gallery of the project.

147

answered Dec 26 '22 16:12

ogrisel

Since Sklearn 0.13 there is indeed an implementation of the HashingVectorizer.

EDIT: Here is a full-fledged example of such an application

Basically, this example demonstrates that you can learn (e.g. classify text) on data that cannot fit in the computer's main memory (but rather on disk / network / ...).

answered Dec 26 '22 16:12

oDDsKooL

In addition to Vowpal Wabbit, gensim might be interesting as well - it too features online Latent Dirichlet Allocation.

answered Dec 26 '22 16:12

Peter Prettenhofer

Related questions
                            
                                eli5: show_weights() with two labels
                            
                                Is standardized scaling a pre-requisite for applying PCA using sklearn?
                            
                                How to get text objects to work with sklearn classifier pipeline?
                            
                                Nested cross-validation: How does cross_validate handle GridSearchCV as its input estimator?
                            
                                How to save and load my neural network model after training along with weights in python?
                            
                                Convert column text data into features using python to use for machine learning
                            
                                Integrate Keras to SKLearn Pipeline?
                            
                                XGboost: cannot pass validation data for eval_set in pipeline
                            
                                model.fit vs model.predict - differences & usage in sklearn
                            
                                TypeError: take(): argument 'index' (position 1) must be Tensor, not numpy.ndarray
                            
                                Target transformation and feature selection in scikit-learn
                            
                                Does sklearn use pandas index as a feature?
                            
                                Can I fix the mean of one component of a Gaussian Mixture Model in python before fitting?
                            
                                Balanced_accuracy is not a valid scoring value in scikit-learn
                            
                                How to force all strings to floats? [duplicate]
                            
                                Does it make sense to use sklearn GridSearchCV together with CalibratedClassifierCV?
                            
                                Custom scikit-learn scorer can't access mean after fit
                            
                                Scikit Pipeline Parameters - fit() got an unexpected keyword argument 'gamma'
                            
                                Reproducing R's gaussian process maximum likelihood regression in Python
                            
                                How to draw the hyperplanes for SVM One-Versus-All?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Possibility to apply online algorithms on big data files with sklearn?

Tags:

large-data

scikit-learn

online-algorithm

register

People also ask

3 Answers

ogrisel

oDDsKooL

Peter Prettenhofer

Recent Activity

Donate For Us