How do you call partial_fit()
on a scikit-learn classifier wrapped inside a Pipeline()?
I'm trying to build an incrementally trainable text classifier using SGDClassifier
like:
from sklearn.linear_model import SGDClassifier
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import HashingVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.multiclass import OneVsRestClassifier
classifier = Pipeline([
('vectorizer', HashingVectorizer(ngram_range=(1,4), non_negative=True)),
('tfidf', TfidfTransformer()),
('clf', OneVsRestClassifier(SGDClassifier())),
])
but I get an AttributeError
trying to call classifier.partial_fit(x,y)
.
It supports fit()
, so I don't see why partial_fit()
isn't available. Would it be possible to introspect the pipeline, call the data transformers, and then directly call partial_fit()
on my classifier?
Pipeline does not use partial_fit , hence does not expose it. We would probably need a dedicated pipelining scheme for out-of-core computation but that also depends on the capabilities of the previous models.
partial_fit: To perform incremental learning, Scikit-learn comes with the option of partial_fit API, which has the ability to learn incrementally from the batch of instances. partial_fit is useful when the whole dataset is too big to fit in memory at once.
The purpose of the pipeline is to assemble several steps that can be cross-validated together while setting different parameters. For this, it enables setting parameters of the various steps using their names and the parameter name separated by a '__' , as in the example below.
The Scikit-learn pipeline is a tool that chains all steps of the workflow together for a more streamlined procedure. The key benefit of building a pipeline is improved readability. Pipelines are able to execute a series of transformations with one call, allowing users to attain results with less code.
Here is what I'm doing - where 'mapper' and 'clf' are the 2 steps in my Pipeline obj.
def partial_pipe_fit(pipeline_obj, df):
X = pipeline_obj.named_steps['mapper'].fit_transform(df)
Y = df['class']
pipeline_obj.named_steps['clf'].partial_fit(X,Y)
You probably want to keep track of performance as you keep adjusting/updating your classifier - but that is a secondary point
so more specifically - the original pipeline(s) were constructed as follows
to_vect = Pipeline([('vect', CountVectorizer(min_df=2, max_df=.9, ngram_range=(1, 1), max_features = 100)),
('tfidf', TfidfTransformer())])
full_mapper = DataFrameMapper([
('norm_text', to_vect),
('norm_fname', to_vect), ])
full_pipe = Pipeline([('mapper', full_mapper), ('clf', SGDClassifier(n_iter=15, warm_start=True,
n_jobs=-1, random_state=self.random_state))])
google DataFrameMapper to learn more about it - but here it just enables a transformation step that plays nice with pandas
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With