I have recently started learning python to develop a predictive model for a research project using machine learning methods. I have a large dataset comprised of both numerical and categorical data. The dataset has lots of missing values. I am currently trying to encode the categorical features using OneHotEncoder. When I read about OneHotEncoder, my understanding was that for a missing value (NaN), OneHotEncoder would assign 0s to all the feature's categories, as such:
0 Male
1 Female
2 NaN
After applying OneHotEncoder:
0 10
1 01
2 00
However, when running the following code:
# Encoding categorical data
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer([('encoder', OneHotEncoder(handle_unknown='ignore'), [1])],
remainder='passthrough')
obj_df = np.array(ct.fit_transform(obj_df))
print(obj_df)
I am getting the error ValueError: Input contains NaN
So I am guessing my previous understanding of how OneHotEncoder handles missing values is wrong. Is there a way for me to get the functionality described above? I know imputing the missing values before encoding will resolve this issue, but I am reluctant to do this as I am dealing with medical data and fear that imputation may decrease the predictive accuracy of my model.
I found this question that is similar but the answer doesn't offer a detailed enough solution on how to deal with the NaN values.
Let me know what your thoughts are, thanks.
You will need to impute the missing values before. You can define a Pipeline
with an imputing step using SimpleImputer
setting a constant
strategy to input a new category for null fields, prior to the OneHot encoding:
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
import numpy as np
categorical_transformer = Pipeline(steps=[
('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
('encoder', OneHotEncoder(handle_unknown='ignore'))])
preprocessor = ColumnTransformer(
transformers=[
('cat', categorical_transformer, [0])
])
df = pd.DataFrame(['Male', 'Female', np.nan])
preprocessor.fit_transform(df)
array([[0., 1., 0.],
[1., 0., 0.],
[0., 0., 1.]])
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With