Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to use scikit's preprocessing/normalization along with cross validation?

As an example of cross-validation without any preprocessing, I can do something like this:

    tuned_params = [{"penalty" : ["l2", "l1"]}]
    from sklearn.linear_model import SGDClassifier
    SGD = SGDClassifier()
    from sklearn.grid_search import GridSearchCV
    clf = GridSearchCV(myClassifier, params, verbose=5)
    clf.fit(x_train, y_train)

I would like to preprocess my data using something like

from sklearn import preprocessing
x_scaled = preprocessing.scale(x_train)

But it would not be a good idea to do this before setting the cross validation, because then the training and testing sets will be normalized together. How do I setup the cross validation to preprocess the corresponding training and test sets separately on each run?

like image 998
Fequish Avatar asked Sep 16 '15 01:09

Fequish


1 Answers

Per the documentation, if you employ Pipeline, this can be done for you. From the docs, just above section 3.1.1.1, emphasis mine:

Just as it is important to test a predictor on data held-out from training, preprocessing (such as standardization, feature selection, etc.) and similar data transformations similarly should be learnt from a training set and applied to held-out data for prediction [...] A Pipeline makes it easier to compose estimators, providing this behavior under cross-validation[.]

More relevant information on pipelines available here.

like image 199
Sean Easter Avatar answered Nov 07 '22 10:11

Sean Easter