Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Lock steps (prevent refit) in scikit-learn pipeline

Tags:

scikit-learn

Is there a convenient mechanism for locking steps in a scikit-learn pipeline to prevent them from refitting on pipeline.fit()? For example:

import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline
from sklearn.datasets import fetch_20newsgroups

data = fetch_20newsgroups(subset='train')
firsttwoclasses = data.target<=1
y = data.target[firsttwoclasses]
X = np.array(data.data)[firsttwoclasses]

pipeline = Pipeline([
    ("vectorizer", CountVectorizer()),
    ("estimator", LinearSVC())
])

# fit intial step on subset of data, perhaps an entirely different subset
# this particular example would not be very useful in practice
pipeline.named_steps["vectorizer"].fit(X[:400])
X2 = pipeline.named_steps["vectorizer"].transform(X)

# fit estimator on all data without refitting vectorizer
pipeline.named_steps["estimator"].fit(X2, y)
print(len(pipeline.named_steps["vectorizer"].vocabulary_))

# fitting entire pipeline refits vectorizer
# is there a convenient way to lock the vectorizer without doing the above?
pipeline.fit(X, y)
print(len(pipeline.named_steps["vectorizer"].vocabulary_))

The only way I could think of doing this without intermediate transformations would be to define a custom estimator class (as seen here) whose fit method does nothing and whose transform method is the transform of the pre-fit transformer. Is this the only way?

like image 887
rytido Avatar asked Feb 09 '17 14:02

rytido


People also ask

How do you use Sklearn pipeline for ridiculously neat code?

Intro to Scikit-learn Pipelines 19 features have NaNs. Now, on to preprocessing. For numeric columns, we first fill the missing values with SimpleImputer using the mean and feature scale using MinMaxScaler . For categoricals, we will again use SimpleImputer to fill the missing values with the mode of each column.

What is the difference between Make_pipeline and pipeline?

The pipeline requires naming the steps, manually. make_pipeline names the steps, automatically. Names are defined explicitly, without rules. Names are generated automatically using a straightforward rule (lower case of the estimator).

What is Columntransformer in Sklearn?

Applies transformers to columns of an array or pandas DataFrame. This estimator allows different columns or column subsets of the input to be transformed separately and the features generated by each transformer will be concatenated to form a single feature space.

What is Make_pipeline Sklearn?

This is a shorthand for the Pipeline constructor; it does not require, and does not permit, naming the estimators. Instead, their names will be set to the lowercase of their types automatically. Parameters: *stepslist of Estimator objects. List of the scikit-learn estimators that are chained together.


1 Answers

Looking through the code, there doesn't seem to be anything in a Pipeline object with functionality like this: calling .fit() on the pipeline results in .fit() on each stage.

The best quick-and-dirty solution I could come up with is to monkey-patch away the stage's fitting functionality:

pipeline.named_steps["vectorizer"].fit(X[:400])
# disable .fit() on the vectorizer step
pipeline.named_steps["vectorizer"].fit = lambda self, X, y=None: self
pipeline.named_steps["vectorizer"].fit_transform = model.named_steps["vectorizer"].transform

pipeline.fit(X, y)
like image 105
Greg Baker Avatar answered Jan 03 '23 13:01

Greg Baker