Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why ColumnTransformer does not call fit on its transformers?

I have defined data for fitting with one categorical feature "sex":

data = pd.DataFrame({
    'age': [25,19, 17],
    'sex': ['female', 'male', 'female'],
    'won_lottery': [False, True, False]
})
X = data[['age', 'sex']]
y = data['won_lottery']

and pipeline for transforming categorical features:

ohe = OneHotEncoder(handle_unknown='ignore')
cat_transformers = Pipeline([
    ('onehot', ohe)
])

When fitting cat_transformers with data directly

cat_transformers.fit(X[['sex']], y)
print(ohe.get_feature_names())

I am able to get names of output features created by OneHotEncoder instance:

['x0_female' 'x0_male']    

However, if I encapsulate cat_transformers into ColumnTransformer:

preprocessor = ColumnTransformer(
    transformers=[
        ('cat', cat_transformers, ['sex'])
    ]
)
preprocessor.fit(X, y)
print(ohe.get_feature_names())

it fails with

sklearn.exceptions.NotFittedError: This OneHotEncoder instance is not fitted yet. 
  Call 'fit' with appropriate arguments before using this method.

I would expect that calling fit() on ColumnTransformer causes calling fit() on all its transformers.

Why it does not work this way?

like image 577
dzieciou Avatar asked Jun 12 '19 06:06

dzieciou


1 Answers

Ok, I understand it now. I was fitting one instance of OneHotEncoder and checking features on another instance:

print(id(ohe))
print(id(preprocessor.named_transformers_['cat'].named_steps['onehot']))

2757198591872
2755226729104

It looks like ColumnTranformer clones its transformers before fitting.

like image 179
dzieciou Avatar answered Nov 19 '22 01:11

dzieciou