Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using partial_fit with Scikit Pipeline

How do you call partial_fit() on a scikit-learn classifier wrapped inside a Pipeline()?

I'm trying to build an incrementally trainable text classifier using SGDClassifier like:

from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.multiclass import OneVsRestClassifier

classifier = Pipeline([
    ('vectorizer', HashingVectorizer(ngram_range=(1,4), non_negative=True)),
    ('tfidf', TfidfTransformer()),
    ('clf', OneVsRestClassifier(SGDClassifier())),
])

but I get an AttributeError trying to call classifier.partial_fit(x,y).

It supports fit(), so I don't see why partial_fit() isn't available. Would it be possible to introspect the pipeline, call the data transformers, and then directly call partial_fit() on my classifier?

like image 967
Cerin Avatar asked Jul 29 '13 18:07

Cerin


People also ask

Does pipeline support partial fit?

Pipeline does not use partial_fit , hence does not expose it. We would probably need a dedicated pipelining scheme for out-of-core computation but that also depends on the capabilities of the previous models.

What is Partial_fit Sklearn?

partial_fit: To perform incremental learning, Scikit-learn comes with the option of partial_fit API, which has the ability to learn incrementally from the batch of instances. partial_fit is useful when the whole dataset is too big to fit in memory at once.

How does Scikit-learn pipeline work?

The purpose of the pipeline is to assemble several steps that can be cross-validated together while setting different parameters. For this, it enables setting parameters of the various steps using their names and the parameter name separated by a '__' , as in the example below.

What is the benefit of using Scikit-learn pipeline utility for data pre processing?

The Scikit-learn pipeline is a tool that chains all steps of the workflow together for a more streamlined procedure. The key benefit of building a pipeline is improved readability. Pipelines are able to execute a series of transformations with one call, allowing users to attain results with less code.


1 Answers

Here is what I'm doing - where 'mapper' and 'clf' are the 2 steps in my Pipeline obj.

def partial_pipe_fit(pipeline_obj, df):
    X = pipeline_obj.named_steps['mapper'].fit_transform(df)
    Y = df['class']
    pipeline_obj.named_steps['clf'].partial_fit(X,Y)

You probably want to keep track of performance as you keep adjusting/updating your classifier - but that is a secondary point

so more specifically - the original pipeline(s) were constructed as follows

to_vect = Pipeline([('vect', CountVectorizer(min_df=2, max_df=.9, ngram_range=(1, 1), max_features = 100)),
                            ('tfidf', TfidfTransformer())])
full_mapper = DataFrameMapper([
            ('norm_text', to_vect),
            ('norm_fname', to_vect), ])

full_pipe = Pipeline([('mapper', full_mapper), ('clf', SGDClassifier(n_iter=15, warm_start=True,
                                                                n_jobs=-1, random_state=self.random_state))])

google DataFrameMapper to learn more about it - but here it just enables a transformation step that plays nice with pandas

like image 112
meyerson Avatar answered Sep 19 '22 18:09

meyerson