I have defined data for fitting with one categorical feature "sex":
data = pd.DataFrame({
'age': [25,19, 17],
'sex': ['female', 'male', 'female'],
'won_lottery': [False, True, False]
})
X = data[['age', 'sex']]
y = data['won_lottery']
and pipeline for transforming categorical features:
ohe = OneHotEncoder(handle_unknown='ignore')
cat_transformers = Pipeline([
('onehot', ohe)
])
When fitting cat_transformers
with data directly
cat_transformers.fit(X[['sex']], y)
print(ohe.get_feature_names())
I am able to get names of output features created by OneHotEncoder
instance:
['x0_female' 'x0_male']
However, if I encapsulate cat_transformers
into ColumnTransformer
:
preprocessor = ColumnTransformer(
transformers=[
('cat', cat_transformers, ['sex'])
]
)
preprocessor.fit(X, y)
print(ohe.get_feature_names())
it fails with
sklearn.exceptions.NotFittedError: This OneHotEncoder instance is not fitted yet.
Call 'fit' with appropriate arguments before using this method.
I would expect that calling fit()
on ColumnTransformer
causes calling fit()
on all its transformers.
Why it does not work this way?
Ok, I understand it now. I was fitting one instance of OneHotEncoder
and checking features on another instance:
print(id(ohe))
print(id(preprocessor.named_transformers_['cat'].named_steps['onehot']))
2757198591872
2755226729104
It looks like ColumnTranformer
clones its transformers before fitting.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With