Use sklearn's GridSearchCV with a pipeline, preprocessing just once

Tags:

I'm using scickit-learn to tune a model hyper-parameters. I'm using a pipeline to have chain the preprocessing with the estimator. A simple version of my problem would look like this:

import numpy as np from sklearn.model_selection import GridSearchCV from sklearn.pipeline import make_pipeline from sklearn.preprocessing import StandardScaler from sklearn.linear_model import LogisticRegression   grid = GridSearchCV(make_pipeline(StandardScaler(), LogisticRegression()),                     param_grid={'logisticregression__C': [0.1, 10.]},                     cv=2,                     refit=False)  _ = grid.fit(X=np.random.rand(10, 3),              y=np.random.randint(2, size=(10,)))

In my case the preprocessing (what would be StandardScale() in the toy example) is time consuming, and I'm not tuning any parameter of it.

So, when I execute the example, the StandardScaler is executed 12 times. 2 fit/predict * 2 cv * 3 parameters. But every time StandardScaler is executed for a different value of the parameter C, it returns the same output, so it'd be much more efficient, to compute it once, and then just run the estimator part of the pipeline.

I can manually split the pipeline between the preprocessing (no hyper parameters tuned) and the estimator. But to apply the preprocessing to the data, I should provide the training set only. So, I would have to implement the splits manually, and not use GridSearchCV at all.

Is there a simple/standard way to avoid repeating the preprocessing while using GridSearchCV?

607

asked Apr 12 '17 10:04

Marc Garcia

1 Answers

Update: Ideally, the answer below should not be used as it leads to data leakage as discussed in comments. In this answer, GridSearchCV will tune the hyperparameters on the data already preprocessed by StandardScaler, which is not correct. In most conditions that should not matter much, but algorithms which are too sensitive to scaling will give wrong results.

Essentially, GridSearchCV is also an estimator, implementing fit() and predict() methods, used by the pipeline.

So instead of:

grid = GridSearchCV(make_pipeline(StandardScaler(), LogisticRegression()),                     param_grid={'logisticregression__C': [0.1, 10.]},                     cv=2,                     refit=False)

Do this:

clf = make_pipeline(StandardScaler(),                      GridSearchCV(LogisticRegression(),                                  param_grid={'logisticregression__C': [0.1, 10.]},                                  cv=2,                                  refit=True))  clf.fit() clf.predict()

What it will do is, call the StandardScalar() only once, for one call to clf.fit() instead of multiple calls as you described.

Edit:

Changed refit to True, when GridSearchCV is used inside a pipeline. As mentioned in documentation:

refit : boolean, default=True Refit the best estimator with the entire dataset. If “False”, it is impossible to make predictions using this GridSearchCV instance after fitting.

If refit=False, clf.fit() will have no effect because the GridSearchCV object inside the pipeline will be reinitialized after fit(). When refit=True, the GridSearchCV will be refitted with the best scoring parameter combination on the whole data that is passed in fit().

So if you want to make the pipeline, just to see the scores of the grid search, only then the refit=False is appropriate. If you want to call the clf.predict() method, refit=True must be used, else Not Fitted error will be thrown.

137

answered Sep 23 '22 04:09

Vivek Kumar

Related questions
                            
                                How to send a mail directly to SMTP server without authentication?
                            
                                OpenCV putText() new line character
                            
                                Disable python import sorting in VSCode
                            
                                Python or IronPython
                            
                                nightmare with relative imports, how does pep 366 work?
                            
                                Appending turns my list to NoneType
                            
                                How to get the URL of a redirect with Python
                            
                                How can I retrieve the TLS/SSL peer certificate of a remote host using python?
                            
                                Selecting rows from a Pandas dataframe with a compound (hierarchical) index
                            
                                separate real and imaginary part of a complex number in python
                            
                                Different meanings of brackets in Python
                            
                                Is there a way to auto generate a __str__() implementation in python?
                            
                                How to use `GridSpec()` with `subplots()`
                            
                                Is there a dedicated way to get the number of items in a python `Enum`?
                            
                                How to use advanced activation layers in Keras?
                            
                                Pandas concat failing
                            
                                Tensorflow Different ways to Export and Run graph in C++
                            
                                Applying pandas qcut bins to new data
                            
                                concurrent.futures.ProcessPoolExecutor vs multiprocessing.pool.Pool [duplicate]
                            
                                Seaborn plots in a loop

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Use sklearn's GridSearchCV with a pipeline, preprocessing just once

Tags:

python

machine-learning

numpy

scikit-learn

grid-search

Marc Garcia

People also ask

1 Answers

Vivek Kumar

Recent Activity

Donate For Us