How to apply StandardScaler in Pipeline in scikit-learn (sklearn)?

Tags:

scikit-learn

In the example below,

pipe = Pipeline([
        ('scale', StandardScaler()),
        ('reduce_dims', PCA(n_components=4)),
        ('clf', SVC(kernel = 'linear', C = 1))])

param_grid = dict(reduce_dims__n_components=[4,6,8],
                  clf__C=np.logspace(-4, 1, 6),
                  clf__kernel=['rbf','linear'])

grid = GridSearchCV(pipe, param_grid=param_grid, cv=3, n_jobs=1, verbose=2)
grid.fit(X_train, y_train)
print(grid.score(X_test, y_test))

I am using StandardScaler(), is this the correct way to apply it to test set as well?

574

asked Jul 21 '18 19:07

2 Answers

Yes, this is the right way to do this but there is a small mistake in your code. Let me break this down for you.

When you use the StandardScaler as a step inside a Pipeline then scikit-learn will internally do the job for you.

What happens can be described as follows:

Step 0: The data are split into TRAINING data and TEST data according to the cv parameter that you specified in the GridSearchCV.
Step 1: the scaler is fitted on the TRAINING data
Step 2: the scaler transforms TRAINING data
Step 3: the models are fitted/trained using the transformed TRAINING data
Step 4: the scaler is used to transform the TEST data
Step 5: the trained models predict using the transformed TEST data

Note: You should be using grid.fit(X, y) and NOT grid.fit(X_train, y_train) because the GridSearchCV will automatically split the data into training and testing data (this happen internally).

Use something like this:

from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import GridSearchCV
from sklearn.decomposition import PCA

pipe = Pipeline([
        ('scale', StandardScaler()),
        ('reduce_dims', PCA(n_components=4)),
        ('clf', SVC(kernel = 'linear', C = 1))])

param_grid = dict(reduce_dims__n_components=[4,6,8],
                  clf__C=np.logspace(-4, 1, 6),
                  clf__kernel=['rbf','linear'])

grid = GridSearchCV(pipe, param_grid=param_grid, cv=3, n_jobs=1, verbose=2, scoring= 'accuracy')
grid.fit(X, y)
print(grid.best_score_)
print(grid.cv_results_)

Once you run this code (when you call grid.fit(X, y)), you can access the outcome of the grid search in the result object returned from grid.fit(). The best_score_ member provides access to the best score observed during the optimization procedure and the best_params_ describes the combination of parameters that achieved the best results.

IMPORTANT EDIT 1: if you want to keep a validation dataset of the original dataset use this:

X_for_gridsearch, X_future_validation, y_for_gridsearch, y_future_validation 
    = train_test_split(X, y, test_size=0.15, random_state=1)

Then use:

grid = GridSearchCV(pipe, param_grid=param_grid, cv=3, n_jobs=1, verbose=2, scoring= 'accuracy')
grid.fit(X_for_gridsearch, y_for_gridsearch)

151

answered Oct 14 '22 08:10

seralouk

Quick answer: Your methodology is correct.

Although the above answer is very good, I just would like to point out some subtleties:

best_score_ [1] is the best cross-validation metric, and not the generalization performance of the model [2]. To evaluate how well the best found parameters generalize, you should call the score on the test set, as you've done. Therefore it is needed to start by splitting the data into training and test set, fit the grid search only in the X_train, y_train, and then score it with X_test, y_test [2].

Deep Dive:

A threefold split of data into training set, validation set and test set is one way to prevent overfitting in the parameters during grid search. On the other hand, GridSearchCV uses Cross-Validation in the training set, instead of having both training and validation set, but this does not replace the test set. This can be verified in [2] and [3].

References:

[1] GridSearchCV

[2] Introduction to Machine Learning with Python

[3] 3.1 Cross-validation: evaluating estimator performance

answered Oct 14 '22 09:10

vcmorini

Related questions
                            
                                operational error: database is locked
                            
                                replace part of path - python
                            
                                Adjust title font size for a Bokeh figure
                            
                                plot a document tfidf 2D graph
                            
                                Attribute Error trying to run Gmail API quickstart in Python
                            
                                how to transform pandas dataframe for insertion via executemany() statement?
                            
                                How to compress/minimize size of JSON/Jsonify with Flask in Python?
                            
                                Filling Many2many field (odoo 8)
                            
                                How to send JSON payload to RabbitMQ using the web plugin?
                            
                                What is the function of SOCK_STREAM?
                            
                                Fetching multiple urls with aiohttp in Python 3.5
                            
                                How to downgrade to Python 3.4 from 3.5
                            
                                Python 3 and base64 encoding of a binary file
                            
                                How to get the max/min value in Pandas DataFrame when nan value in it
                            
                                Ipython cv2.imwrite() not saving image
                            
                                Can't update to numpy 1.13 with anaconda?
                            
                                Pylint warning for "useless super delegation"
                            
                                SSLError("bad handshake") when trying to access resources Custom Certificates and Requests
                            
                                How to convert UTC to EST with Python and take care of daylight saving automatically?
                            
                                Sort a list of lists by length and value in Python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to apply StandardScaler in Pipeline in scikit-learn (sklearn)?

Tags:

python

scikit-learn

user308827

People also ask

2 Answers

seralouk

vcmorini

Recent Activity

Donate For Us