Scikit-Learn: Avoiding Data Leakage During Cross-Validation

Tags:

I've just been reading up on k-fold cross-validation and have realized that I'm inadvertently leaking data with my current preprocessing setup.

Usually, I have a train and test dataset. I do a bunch of data imputation and one-hot encoding on my entire train dataset and then run k-fold cross-validation.

The leakage comes in because, if I'm doing 5-fold cross-validation, I'm training on 80% of my train data and testing it on the remaining 20% of the train data.

I really should just be imputing the 20% based on the 80% of train (whereas I was using 100% of the data before).

1) Is this the right way to think about cross-validation?

2) I've been looking at the Pipeline class in sklearn.pipeline and it seems useful for doing a bunch of transformations and then finally fitting a model to the resulting data. However, I'm doing a bunch of stuff like "impute missing data in float64 columns with the mean", "impute all other data with the mode", etc.

There isn't an obvious transformer for this kind of imputation. How would I go about adding this step to a Pipeline? Would I just make my own subclass of BaseEstimator?

Any guidance here would be great!

488

asked Jan 28 '18 01:01

anon_swe

2 Answers

1) Yes, you should impute the 20% test data using the 80% training data.

2) I wrote a blog post that answers your second question, but I'll include the core parts here.

With sklearn.pipeline, you can apply separate preprocessing rules to different feature types (e.g., numeric, categorical). In the example code below, I impute the median of numeric features before scaling them. The categorical and boolean features are imputed with the mode -- the categorical features are one-hot encoded.

You can include an estimator at the end of the pipeline for regression, classification, etc.

import numpy as np
from sklearn.pipeline import make_pipeline, FeatureUnion
from sklearn.preprocessing import OneHotEncoder, Imputer, StandardScaler

preprocess_pipeline = make_pipeline(
    FeatureUnion(transformer_list=[
        ("numeric_features", make_pipeline(
            TypeSelector(np.number),
            Imputer(strategy="median"),
            StandardScaler()
        )),
        ("categorical_features", make_pipeline(
            TypeSelector("category"),
            Imputer(strategy="most_frequent"),
            OneHotEncoder()
        )),
        ("boolean_features", make_pipeline(
            TypeSelector("bool"),
            Imputer(strategy="most_frequent")
        ))
    ])
)

The TypeSelector portion of the pipeline assumes the object X is a pandas DataFrame. The subset of columns with the given data type are selected with TypeSelector.transform.

from sklearn.base import BaseEstimator, TransformerMixin
import pandas as pd

class TypeSelector(BaseEstimator, TransformerMixin):
    def __init__(self, dtype):
        self.dtype = dtype

    def fit(self, X, y=None):
        return self

    def transform(self, X):
        assert isinstance(X, pd.DataFrame)
        return X.select_dtypes(include=[self.dtype])

answered Jan 03 '23 19:01

ramhiser

I recommend thinking of 5-fold cross validation as simply splitting up the data into 5 parts (or folds). You hold out one fold for testing and using the other 4 together for your training set. We repeat this process another 4 times until each fold has had the chance to be tested.

For your imputation to work correctly and not be subject to contamination, you would need to determine the mean from the 4 folds used for testing, and use it to impute that value in both the training set and test set.

I like to implement the CV split with StratifiedKFold. This will ensure you have the same number of samples for each class in the folds.

To answer your question about using Pipelines, I would say you should probably subclass the BaseEstimator with your custom Imputation transformer. Inside of your loop for the CV-split, you should compute the mean from your training set then set this mean as a parameter in your transformer. Then you can call fit or transform.

answered Jan 03 '23 18:01

Robert F. Dickerson

Related questions
                            
                                AttributeError: 'GridSearchCV' object has no attribute 'cv_results_'
                            
                                sklearn StandardScaler returns all zeros
                            
                                how does 2d kernel density estimation in python (sklearn) work?
                            
                                Scaling / Normalizing pandas column
                            
                                How to fix the false positives rate of a linear SVM?
                            
                                cluster points after KMeans clustering (scikit learn)
                            
                                ValueError: Number of features of the model must match the input
                            
                                Splitting data using time-based splitting in test and train datasets
                            
                                Convert categorical variables from String to int representation
                            
                                Why am i getting AttributeError: 'KerasClassifier' object has no attribute 'model'?
                            
                                Confusion matrix and test accuracy for PyTorch Transfer Learning tutorial
                            
                                Sklearn changing string class label to int
                            
                                How do I solve the future warning -> % (min_groups, self.n_splits)), Warning) in python?
                            
                                .arff files with scikit-learn?
                            
                                How to get inertia value for each k-means cluster using scikit-learn?
                            
                                Scaling t-SNE to millions of observations in scikit-learn
                            
                                Compare ways to tune hyperparameters in scikit-learn
                            
                                Why are LASSO in sklearn (python) and matlab statistical package different?
                            
                                How to explain feature importance after one-hot encode used for decision tree

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Scikit-Learn: Avoiding Data Leakage During Cross-Validation

Tags:

scikit-learn

pipeline

cross-validation

anon_swe

People also ask

2 Answers

ramhiser

Robert F. Dickerson

Recent Activity

Donate For Us