How to use scikit's preprocessing/normalization along with cross validation?

Question

As an example of cross-validation without any preprocessing, I can do something like this:

    tuned_params = [{"penalty" : ["l2", "l1"]}]
    from sklearn.linear_model import SGDClassifier
    SGD = SGDClassifier()
    from sklearn.grid_search import GridSearchCV
    clf = GridSearchCV(myClassifier, params, verbose=5)
    clf.fit(x_train, y_train)

I would like to preprocess my data using something like

from sklearn import preprocessing
x_scaled = preprocessing.scale(x_train)

But it would not be a good idea to do this before setting the cross validation, because then the training and testing sets will be normalized together. How do I setup the cross validation to preprocess the corresponding training and test sets separately on each run?

Sean Easter · Accepted Answer

Per the documentation, if you employ Pipeline, this can be done for you. From the docs, just above section 3.1.1.1, emphasis mine:

Just as it is important to test a predictor on data held-out from training, preprocessing (such as standardization, feature selection, etc.) and similar data transformations similarly should be learnt from a training set and applied to held-out data for prediction [...] A Pipeline makes it easier to compose estimators, providing this behavior under cross-validation[.]

More relevant information on pipelines available here.

How to use scikit's preprocessing/normalization along with cross validation?

Tags:

python

scikit-learn

Fequish

1 Answers

Sean Easter

Recent Activity

Donate For Us

How to use scikit's preprocessing/normalization along with cross validation?

Tags:

python

scikit-learn

Fequish

1 Answers

Sean Easter

Related questions

Recent Activity

Donate For Us