Is there a convenient mechanism for locking steps in a scikit-learn pipeline to prevent them from refitting on pipeline.fit()? For example:
import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline
from sklearn.datasets import fetch_20newsgroups
data = fetch_20newsgroups(subset='train')
firsttwoclasses = data.target<=1
y = data.target[firsttwoclasses]
X = np.array(data.data)[firsttwoclasses]
pipeline = Pipeline([
("vectorizer", CountVectorizer()),
("estimator", LinearSVC())
])
# fit intial step on subset of data, perhaps an entirely different subset
# this particular example would not be very useful in practice
pipeline.named_steps["vectorizer"].fit(X[:400])
X2 = pipeline.named_steps["vectorizer"].transform(X)
# fit estimator on all data without refitting vectorizer
pipeline.named_steps["estimator"].fit(X2, y)
print(len(pipeline.named_steps["vectorizer"].vocabulary_))
# fitting entire pipeline refits vectorizer
# is there a convenient way to lock the vectorizer without doing the above?
pipeline.fit(X, y)
print(len(pipeline.named_steps["vectorizer"].vocabulary_))
The only way I could think of doing this without intermediate transformations would be to define a custom estimator class (as seen here) whose fit method does nothing and whose transform method is the transform of the pre-fit transformer. Is this the only way?
Intro to Scikit-learn Pipelines 19 features have NaNs. Now, on to preprocessing. For numeric columns, we first fill the missing values with SimpleImputer using the mean and feature scale using MinMaxScaler . For categoricals, we will again use SimpleImputer to fill the missing values with the mode of each column.
The pipeline requires naming the steps, manually. make_pipeline names the steps, automatically. Names are defined explicitly, without rules. Names are generated automatically using a straightforward rule (lower case of the estimator).
Applies transformers to columns of an array or pandas DataFrame. This estimator allows different columns or column subsets of the input to be transformed separately and the features generated by each transformer will be concatenated to form a single feature space.
This is a shorthand for the Pipeline constructor; it does not require, and does not permit, naming the estimators. Instead, their names will be set to the lowercase of their types automatically. Parameters: *stepslist of Estimator objects. List of the scikit-learn estimators that are chained together.
Looking through the code, there doesn't seem to be anything in a Pipeline object with functionality like this: calling .fit() on the pipeline results in .fit() on each stage.
The best quick-and-dirty solution I could come up with is to monkey-patch away the stage's fitting functionality:
pipeline.named_steps["vectorizer"].fit(X[:400])
# disable .fit() on the vectorizer step
pipeline.named_steps["vectorizer"].fit = lambda self, X, y=None: self
pipeline.named_steps["vectorizer"].fit_transform = model.named_steps["vectorizer"].transform
pipeline.fit(X, y)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With