Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Are the k-fold cross-validation scores from scikit-learn's `cross_val_score` and `GridsearchCV` biased if we include transformers in the pipeline?

Data pre-processers such as StandardScaler should be used to fit_transform the train set and only transform (not fit) the test set. I expect the same fit/transform process applies to cross-validation for tuning the model. However, I found cross_val_score and GridSearchCV fit_transform the entire train set with the preprocessor (rather than fit_transform the inner_train set, and transform the inner_validation set). I believe this artificially removes the variance from the inner_validation set which makes the cv score (the metric used to select the best model by GridSearch) biased. Is this a concern or did I actually miss anything?

To demonstrate the above issue, I tried the following three simple test cases with the Breast Cancer Wisconsin (Diagnostic) Data Set from Kaggle.

  1. I intentionally fit and transform the entire X with StandardScaler()
X_sc = StandardScaler().fit_transform(X)
lr = LogisticRegression(penalty='l2', random_state=42)
cross_val_score(lr, X_sc, y, cv=5)
  1. I include SC and LR in the Pipeline and run cross_val_score
pipe = Pipeline([
    ('sc', StandardScaler()),
    ('lr', LogisticRegression(penalty='l2', random_state=42))
])
cross_val_score(pipe, X, y, cv=5)
  1. Same as 2 but with GridSearchCV
pipe = Pipeline([
    ('sc', StandardScaler()),
    ('lr', LogisticRegression(random_state=42))
])
params = {
    'lr__penalty': ['l2']
}
gs=GridSearchCV(pipe,
param_grid=params, cv=5).fit(X, y)
gs.cv_results_

They all produce the same validation scores. [0.9826087 , 0.97391304, 0.97345133, 0.97345133, 0.99115044]

like image 861
Kai Zhao Avatar asked Aug 26 '19 03:08

Kai Zhao


1 Answers

No, sklearn doesn't do fit_transform with entire dataset.

To check this, I subclassed StandardScaler to print the size of the dataset sent to it.

class StScaler(StandardScaler):
    def fit_transform(self,X,y=None):
        print(len(X))
        return super().fit_transform(X,y)

If you now replace StandardScaler in your code, you'll see dataset size passed in first case is actually bigger.

But why does the accuracy remain exactly same? I think this is because LogisticRegression is not very sensitive to feature scale. If we instead use a classifier that is very sensitive to scale, like KNeighborsClassifier for example, you'll find accuracy between two cases start to vary.

X,y = load_breast_cancer(return_X_y=True)
X_sc = StScaler().fit_transform(X)
lr = KNeighborsClassifier(n_neighbors=1)
cross_val_score(lr, X_sc,y, cv=5)

Outputs:

569
[0.94782609 0.96521739 0.97345133 0.92920354 0.9380531 ]

And the 2nd case,

pipe = Pipeline([
    ('sc', StScaler()),
    ('lr', KNeighborsClassifier(n_neighbors=1))
])
print(cross_val_score(pipe, X, y, cv=5))

Outputs:

454
454
456
456
456
[0.95652174 0.97391304 0.97345133 0.92920354 0.9380531 ]

Not big change accuracy-wise, but change nonetheless.

like image 96
Shihab Shahriar Khan Avatar answered Sep 22 '22 10:09

Shihab Shahriar Khan