Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to use SimpleImputer class to impute missing values in different columns with different constant values?

I was using sklearn.impute.SimpleImputer(strategy='constant',fill_value= 0) to impute all columns with missing values with a constant value(0 being that constant value here).

But, it sometimes makes sense to impute different constant values in different columns. For example, i might like to replace all NaN values of a certain column with the maximum value from that column, or some other column's NaN values with minimum or let's suppose median/mean of that particular column values.

How can i achieve this?

Also, i'm actually new to this field, so i'm not really sure if doing this might improve my model's results. Your opinions are welcome.

like image 960
lenikhilsingh Avatar asked Jul 16 '19 14:07

lenikhilsingh


People also ask

Does SimpleImputer work with categorical variables?

SimpleImputer is designed to work with numerical data, but can also handle categorical data represented as strings. SimpleImputer can be used as part of a scikit-learn Pipeline. The default strategy is “mean”, which replaces missing values with the median value of the column.

How do you impute missing values for categorical variables?

One approach to imputing categorical features is to replace missing values with the most common class. You can do with by taking the index of the most common feature given in Pandas' value_counts function.


1 Answers

If you want to impute different features with different arbitrary values, or the median, you need to set up several SimpleImputer steps within a pipeline and then join them with the ColumnTransformer:

import pandas as pd
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

# first we need to make lists, indicating which features
# will be imputed with each method

features_numeric = ['LotFrontage', 'MasVnrArea', 'GarageYrBlt']
features_categoric = ['BsmtQual', 'FireplaceQu']

# then we instantiate the imputers, within a pipeline
# we create one imputer for numerical and one imputer
# for categorical

# this imputer imputes with the mean
imputer_numeric = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='mean')),
])

# this imputer imputes with an arbitrary value
imputer_categoric = Pipeline(
    steps=[('imputer',
            SimpleImputer(strategy='constant', fill_value='Missing'))])

# then we put the features list and the transformers together
# using the column transformer

preprocessor = ColumnTransformer(transformers=[('imputer_numeric',
                                                imputer_numeric,
                                                features_numeric),
                                               ('imputer_categoric',
                                                imputer_categoric,
                                                features_categoric)])

# now we fit the preprocessor
preprocessor.fit(X_train)

# and now we can impute the data
# remember it returs a numpy array

X_train = preprocessor.transform(X_train)
X_test = preprocessor.transform(X_test)

Alternatively, you can use the package Feature-Engine which transformers allow you to specify the features:

from feature_engine import imputation as msi
from sklearn.pipeline import Pipeline as pipe

pipe = pipe([
    # add a binary variable to indicate missing information for the 2 variables below
    ('continuous_var_imputer', msi.AddMissingIndicator(variables = ['LotFrontage', 'GarageYrBlt'])),
     
    # replace NA by the median in the 3 variables below, they are numerical
    ('continuous_var_median_imputer', msi.MeanMedianImputer(imputation_method='median', variables = ['LotFrontage', 'GarageYrBlt', 'MasVnrArea'])),
     
    # replace NA by adding the label "Missing" in categorical variables (transformer will skip those variables where there is no NA)
    ('categorical_imputer', msi.CategoricalImputer(variables = ['var1', 'var2'])),
     
    # median imputer
    # to handle those, I will add an additional step here
    ('additional_median_imputer', msi.MeanMedianImputer(imputation_method='median', variables = ['var4', 'var5'])),
     ])

pipe.fit(X_train)
X_train_t = pipe.transform(X_train)

Feature-engine returns dataframes. More info in this link.

To install Feature-Engine do:

pip install feature-engine

Hope that helps

like image 159
Sole Galli Avatar answered Oct 24 '22 12:10

Sole Galli