Lock steps (prevent refit) in scikit-learn pipeline

Tags:

scikit-learn

Is there a convenient mechanism for locking steps in a scikit-learn pipeline to prevent them from refitting on pipeline.fit()? For example:

import numpy as np
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline
from sklearn.datasets import fetch_20newsgroups

data = fetch_20newsgroups(subset='train')
firsttwoclasses = data.target<=1
y = data.target[firsttwoclasses]
X = np.array(data.data)[firsttwoclasses]

pipeline = Pipeline([
    ("vectorizer", CountVectorizer()),
    ("estimator", LinearSVC())
])

# fit intial step on subset of data, perhaps an entirely different subset
# this particular example would not be very useful in practice
pipeline.named_steps["vectorizer"].fit(X[:400])
X2 = pipeline.named_steps["vectorizer"].transform(X)

# fit estimator on all data without refitting vectorizer
pipeline.named_steps["estimator"].fit(X2, y)
print(len(pipeline.named_steps["vectorizer"].vocabulary_))

# fitting entire pipeline refits vectorizer
# is there a convenient way to lock the vectorizer without doing the above?
pipeline.fit(X, y)
print(len(pipeline.named_steps["vectorizer"].vocabulary_))

The only way I could think of doing this without intermediate transformations would be to define a custom estimator class (as seen here) whose fit method does nothing and whose transform method is the transform of the pre-fit transformer. Is this the only way?

887

asked Feb 09 '17 14:02

rytido

1 Answers

Looking through the code, there doesn't seem to be anything in a Pipeline object with functionality like this: calling .fit() on the pipeline results in .fit() on each stage.

The best quick-and-dirty solution I could come up with is to monkey-patch away the stage's fitting functionality:

pipeline.named_steps["vectorizer"].fit(X[:400])
# disable .fit() on the vectorizer step
pipeline.named_steps["vectorizer"].fit = lambda self, X, y=None: self
pipeline.named_steps["vectorizer"].fit_transform = model.named_steps["vectorizer"].transform

pipeline.fit(X, y)

105

answered Jan 03 '23 13:01

Greg Baker

Related questions
                            
                                Is sklearn compatible with Linux-alpine?
                            
                                Current node to next node feature combinations in decision tree learning: useful to determine potential interactions?
                            
                                How to weight classes in a RandomForest implementation?
                            
                                Comparing parallel k-means batch vs mini-batch speed
                            
                                Object detection in images (HOG)
                            
                                How to collapse a RandomForest into an equivalent decision tree?
                            
                                Dimension reduction with t-SNE
                            
                                Combining Recursive Feature Elimination and Grid Search in scikit-learn
                            
                                How can I get the relative importance of features of a logistic regression for a particular prediction?
                            
                                How to do GridSearchCV with OneVsRestClassifier?
                            
                                Troubleshooting tips for clustering word2vec output with DBSCAN
                            
                                Fitting partial Gaussian
                            
                                Preparing variable-length data for sklearn
                            
                                Pipeline and GridSearch for Doc2Vec
                            
                                How to change max_iter in optimize function used by sklearn gaussian process regression?
                            
                                How to scale target values of a Keras autoencoder model using a sklearn pipeline?
                            
                                Plot confusion matrix with Keras data generator using sklearn
                            
                                scikit-learn kmeans custom distance [duplicate]
                            
                                How to quickly calculate cosine similarity for large number of vectors in Python?
                            
                                line (travel path) clustering machine learning algorithm [closed]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With