Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Implementing KNN imputation on categorical variables in an sklearn pipeline

I am implementing a pre-processing pipeline using sklearn's pipeline transformers. My pipeline includes sklearn's KNNImputer estimator that I want to use to impute categorical features in my dataset. (My question is similar to this thread but it doesn't contain the answer to my question: How to implement KNN to impute categorical features in a sklearn pipeline)

I know that the categorical features have to be encoded before imputation and this is where I am having trouble. With standard label/ordinal/onehot encoders, when trying to encode categorical features with missing values (np.nan) you get the following error:

ValueError: Input contains NaN

I've managed to "by-pass" that by creating a custom encoder where I replace the np.nan with 'Missing':

class CustomEncoder(BaseEstimator, TransformerMixin):
    def __init__(self):
        self.encoder = None

    def fit(self, X, y=None):
        self.encoder = OrdinalEncoder()
        return self.encoder.fit(X.fillna('Missing'))

    def transform(self, X, y=None):
        return self.encoder.transform(X.fillna('Missing'))

    def fit_transform(self, X, y=None, **fit_params):
        self.encoder = OrdinalEncoder()
        return self.encoder.fit_transform(X.fillna('Missing'))

preprocessor = ColumnTransformer([
    ('categoricals', CustomEncoder(), cat_features),
    ('numericals', StandardScaler(), num_features)],
    remainder='passthrough'
)

pipeline = Pipeline([
    ('preprocessing', preprocessor),
    ('imputing', KNNImputer(n_neighbors=5))
])

In this scenario however I cannot find a reasonable way to then set the encoded 'Missing' values back to np.nan before imputing with the KNNImputer.

I've read that I could do this manually using the OneHotEncoder transformer on this thread: Cyclical Loop Between OneHotEncoder and KNNImpute in Scikit-learn, but again, I'd like to implement all of this in a pipeline to automate the entire pre-processing phase.

Has anyone managed to do this? Would anyone recommend an alternative solution? Is imputing with a KNN algorithm maybe not worth the trouble and should I use a simple imputer instead?

Thanks in advance for your feedback!

like image 925
LazyEval Avatar asked Nov 18 '20 20:11

LazyEval


2 Answers

I am afraid that this cannot work. If you one-hot encode your categorical data, your missing values will be encoded into a new binary variable and KNNImputer will fail to deal with them because:

  • it works on each column at a time, not on the full set of one-hot encoded columns
  • there won't any missing to be dealt with anymore

Anyway, you have a couple of options for imputing missing categorical variables using scikit-learn:

  1. you can use sklearn.impute.SimpleImputer using strategy="most_frequent": this will replace missing values using the most frequent value along each column, no matter if they are strings or numeric data
  2. use sklearn.impute.KNNImputer with some limitation: you have first to transform your categorical features into numeric ones while preserving the NaN values (see: LabelEncoder that keeps missing values as 'NaN'), then you can use the KNNImputer using only the nearest neighbour as replacement (if you use more than one neighbour it will render some meaningless average). For example:
    import numpy as np
    import pandas as pd
    from sklearn.preprocessing import LabelEncoder
    from sklearn.impute import KNNImputer
    
    df = pd.DataFrame({'A': ['x', np.NaN, 'z'], 'B': [1, 6, 9], 'C': [2, 1, np.NaN]})
    
    df = df.apply(lambda series: pd.Series(
        LabelEncoder().fit_transform(series[series.notnull()]),
        index=series[series.notnull()].index
    ))
    
    imputer = KNNImputer(n_neighbors=1)
    imputer.fit_transform(df)
    
    In:
        A   B   C
    0   x   1   2.0
    1   NaN 6   1.0
    2   z   9   NaN
    
    Out:
    array([[0., 0., 1.],
           [0., 1., 0.],
           [1., 2., 0.]])
  1. Use sklearn.impute.IterativeImputer and replicate a MissForest imputer for mixed data (but you will have to processe separately numeric from categorical features). For example:
    import numpy as np
    import pandas as pd
    from sklearn.preprocessing import LabelEncoder
    from sklearn.experimental import enable_iterative_imputer
    from sklearn.impute import IterativeImputer
    from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
    
    df = pd.DataFrame({'A': ['x', np.NaN, 'z'], 'B': [1, 6, 9], 'C': [2, 1, np.NaN]})
    
    categorical = ['A']
    numerical = ['B', 'C']
    
    df[categorical] = df[categorical].apply(lambda series: pd.Series(
        LabelEncoder().fit_transform(series[series.notnull()]),
        index=series[series.notnull()].index
    ))
    
    print(df)
    
    imp_num = IterativeImputer(estimator=RandomForestRegressor(),
                               initial_strategy='mean',
                               max_iter=10, random_state=0)
    imp_cat = IterativeImputer(estimator=RandomForestClassifier(), 
                               initial_strategy='most_frequent',
                               max_iter=10, random_state=0)
    
    df[numerical] = imp_num.fit_transform(df[numerical])
    df[categorical] = imp_cat.fit_transform(df[categorical])
    
    print(df)
like image 83
Luca Massaron Avatar answered Sep 20 '22 12:09

Luca Massaron


For anyone interested, I managed to implement a custom label encoder that ignores np.nan and compatible with the sklearn pipeline transformer, similar to Luca Massaron's LEncoder that he implemented on his github repo: https://github.com/lmassaron/deep_learning_for_tabular_data

class CustomEncoder(BaseEstimator, TransformerMixin):
    def __init__(self):
        self.encoders = dict()

    def fit(self, X, y=None):
        for col in X.columns:
            le = LabelEncoder()
            le.fit(X.loc[X[col].notna(), col])
            le_dict = dict(zip(le.classes_, le.transform(le.classes_)))

            # Set unknown to new value so transform on test set handles unknown values
            max_value = max(le_dict.values())
            le_dict['_unk'] = max_value + 1

            self.encoders[col] = le_dict
        return self

    def transform(self, X, y=None):
        for col in X.columns:
            le_dict = self.encoders[col]
            X.loc[X[col].notna(), col] = X.loc[X[col].notna(), col].apply(
                lambda x: le_dict.get(x, le_dict['_unk'])).values
        return X

    def fit_transform(self, X, y=None, **fit_params):
        self.fit(X, y)
        return self.transform(X, y)
like image 27
LazyEval Avatar answered Sep 20 '22 12:09

LazyEval