Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scikit-Learn: Avoiding Data Leakage During Cross-Validation

I've just been reading up on k-fold cross-validation and have realized that I'm inadvertently leaking data with my current preprocessing setup.

Usually, I have a train and test dataset. I do a bunch of data imputation and one-hot encoding on my entire train dataset and then run k-fold cross-validation.

The leakage comes in because, if I'm doing 5-fold cross-validation, I'm training on 80% of my train data and testing it on the remaining 20% of the train data.

I really should just be imputing the 20% based on the 80% of train (whereas I was using 100% of the data before).

1) Is this the right way to think about cross-validation?

2) I've been looking at the Pipeline class in sklearn.pipeline and it seems useful for doing a bunch of transformations and then finally fitting a model to the resulting data. However, I'm doing a bunch of stuff like "impute missing data in float64 columns with the mean", "impute all other data with the mode", etc.

There isn't an obvious transformer for this kind of imputation. How would I go about adding this step to a Pipeline? Would I just make my own subclass of BaseEstimator?

Any guidance here would be great!

like image 488
anon_swe Avatar asked Jan 28 '18 01:01

anon_swe


People also ask

Does cross-validation lead to data leakage?

It is a major problem in machine learning and this is common when doing cross-validation. Knowledge leak or data leak is when information outside the training set is used to create models. This additional information allows the model to know something additional about the data that otherwise it would have not known.

How can you avoid data leakage when performing data preparation?

To minimize or avoid the problem of data leakage, we should try to set aside a validation set in addition to training and test sets if possible. The purpose of the validation set is to mimic the real-life scenario and can be used as a final step.

Does Scikit learn provide support for cross-validation techniques?

Sklearn offers two methods for quick evaluation using cross-validation. cross-val-score returns a list of model scores and cross-validate also reports training times.

What instruction is used in the Scikit learn library to use cross-validation?

The simplest way to use cross-validation is to call the cross_val_score helper function on the estimator and the dataset. >>> from sklearn. model_selection import cross_val_score >>> clf = svm.


2 Answers

1) Yes, you should impute the 20% test data using the 80% training data.

2) I wrote a blog post that answers your second question, but I'll include the core parts here.

With sklearn.pipeline, you can apply separate preprocessing rules to different feature types (e.g., numeric, categorical). In the example code below, I impute the median of numeric features before scaling them. The categorical and boolean features are imputed with the mode -- the categorical features are one-hot encoded.

You can include an estimator at the end of the pipeline for regression, classification, etc.

import numpy as np
from sklearn.pipeline import make_pipeline, FeatureUnion
from sklearn.preprocessing import OneHotEncoder, Imputer, StandardScaler

preprocess_pipeline = make_pipeline(
    FeatureUnion(transformer_list=[
        ("numeric_features", make_pipeline(
            TypeSelector(np.number),
            Imputer(strategy="median"),
            StandardScaler()
        )),
        ("categorical_features", make_pipeline(
            TypeSelector("category"),
            Imputer(strategy="most_frequent"),
            OneHotEncoder()
        )),
        ("boolean_features", make_pipeline(
            TypeSelector("bool"),
            Imputer(strategy="most_frequent")
        ))
    ])
)

The TypeSelector portion of the pipeline assumes the object X is a pandas DataFrame. The subset of columns with the given data type are selected with TypeSelector.transform.

from sklearn.base import BaseEstimator, TransformerMixin
import pandas as pd

class TypeSelector(BaseEstimator, TransformerMixin):
    def __init__(self, dtype):
        self.dtype = dtype

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        assert isinstance(X, pd.DataFrame)
        return X.select_dtypes(include=[self.dtype])
like image 51
ramhiser Avatar answered Jan 03 '23 19:01

ramhiser


I recommend thinking of 5-fold cross validation as simply splitting up the data into 5 parts (or folds). You hold out one fold for testing and using the other 4 together for your training set. We repeat this process another 4 times until each fold has had the chance to be tested.

For your imputation to work correctly and not be subject to contamination, you would need to determine the mean from the 4 folds used for testing, and use it to impute that value in both the training set and test set.

I like to implement the CV split with StratifiedKFold. This will ensure you have the same number of samples for each class in the folds.

To answer your question about using Pipelines, I would say you should probably subclass the BaseEstimator with your custom Imputation transformer. Inside of your loop for the CV-split, you should compute the mean from your training set then set this mean as a parameter in your transformer. Then you can call fit or transform.

like image 37
Robert F. Dickerson Avatar answered Jan 03 '23 18:01

Robert F. Dickerson