Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to handle missing values (NaN) in categorical data when using scikit-learn OneHotEncoder?

I have recently started learning python to develop a predictive model for a research project using machine learning methods. I have a large dataset comprised of both numerical and categorical data. The dataset has lots of missing values. I am currently trying to encode the categorical features using OneHotEncoder. When I read about OneHotEncoder, my understanding was that for a missing value (NaN), OneHotEncoder would assign 0s to all the feature's categories, as such:

0     Male 
1     Female
2     NaN

After applying OneHotEncoder:

0     10 
1     01
2     00

However, when running the following code:

    # Encoding categorical data
    from sklearn.compose import ColumnTransformer
    from sklearn.preprocessing import OneHotEncoder


    ct = ColumnTransformer([('encoder', OneHotEncoder(handle_unknown='ignore'), [1])],
                           remainder='passthrough')
    obj_df = np.array(ct.fit_transform(obj_df))
    print(obj_df)

I am getting the error ValueError: Input contains NaN

So I am guessing my previous understanding of how OneHotEncoder handles missing values is wrong. Is there a way for me to get the functionality described above? I know imputing the missing values before encoding will resolve this issue, but I am reluctant to do this as I am dealing with medical data and fear that imputation may decrease the predictive accuracy of my model.

I found this question that is similar but the answer doesn't offer a detailed enough solution on how to deal with the NaN values.

Let me know what your thoughts are, thanks.

like image 717
sums22 Avatar asked Feb 04 '23 15:02

sums22


1 Answers

You will need to impute the missing values before. You can define a Pipeline with an imputing step using SimpleImputer setting a constant strategy to input a new category for null fields, prior to the OneHot encoding:

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
import numpy as np

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('encoder', OneHotEncoder(handle_unknown='ignore'))])

preprocessor = ColumnTransformer(
    transformers=[
        ('cat', categorical_transformer, [0])
    ])

df = pd.DataFrame(['Male', 'Female', np.nan])
preprocessor.fit_transform(df)
array([[0., 1., 0.],
       [1., 0., 0.],
       [0., 0., 1.]])
like image 156
yatu Avatar answered Feb 06 '23 14:02

yatu