I'm trying to create an sklearn.compose.ColumnTransformer
pipeline for transforming both categorical and continuous input data:
import pandas as pd
from sklearn.base import TransformerMixin, BaseEstimator
from sklearn.preprocessing import OneHotEncoder, FunctionTransformer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer, make_column_transformer
from sklearn.impute import SimpleImputer
df = pd.DataFrame(
{
'a': [1, 'a', 1, np.nan, 'b'],
'b': [1, 2, 3, 4, 5],
'c': list('abcde'),
'd': list('aaabb'),
'e': [0, 1, 1, 0, 1],
}
)
for col in df.select_dtypes('object'):
df[col] = df[col].astype(str)
categorical_columns = list('acd')
continuous_columns = list('be')
categorical_transformer = OneHotEncoder(sparse=False, handle_unknown='ignore')
continuous_transformer = 'passthrough'
column_transformer = ColumnTransformer(
[
('categorical', categorical_transformer, categorical_columns),
('continuous', continuous_transformer, continuous_columns),
]
,
sparse_threshold=0.,
n_jobs=-1
)
X = column_transformer.fit_transform(df)
I want to access the feature names created by this transformation pipeline, so I try this:
column_transformer.get_feature_names()
Which raises:
NotImplementedError: get_feature_names is not yet supported when using a 'passthrough' transformer.
Since I'm not technically doing anything with columns b
and e
, I technically could just append them onto X
after one-hot encoding all other features, but is there some way I can use one of the scikit base classes (e.g. TransformerMixin
, BaseEstimator
, or FunctionTransformer
) to add to this pipeline so I can grab the continuous feature names in a very pipeline-friendly way?
Something like this, perhaps:
class PassthroughTransformer(FunctionTransformer, BaseEstimator):
def fit(self):
return self
def transform(self, X)
self.X = X
return X
def get_feature_names(self):
return self.X.values.tolist()
continuous_transformer = PassthroughTransformer()
column_transformer = ColumnTransformer(
[
('categorical', categorical_transformer, categorical_columns),
('continuous', continuous_transformer, continuous_columns),
]
,
sparse_threshold=0.,
n_jobs=-1
)
X = column_transformer.fit_transform(df)
But this raises this exception:
TypeError: Cannot clone object '<__main__.PassthroughTransformer object at 0x1132ddf60>' (type <class '__main__.PassthroughTransformer'>): it does not seem to be a scikit-learn estimator as it does not implement a 'get_params' methods.
There are multiple issues here:
Cannot clone object error is due to parallel processing:
By default, scikit-learn
clones (which uses pickle
) the supplied transformers and estimators when its working in Pipeline and similar (FeatureUnion, ColumnTransformer etc) or in cross-validation (cross_val_score
, GridSearchCV
etc).
Now, you have specified n_jobs=-1
in your ColumnTransformer
, which introduces multiprocessing in the code. Python's inbuilt pickling dont work well with multiprocessing. And hence the error.
Options:
Set n_jobs = 1
to not use multiprocessing. Still need to correct the code according to point 2 and 3.
If you want to use multiprocessing, then simplest solution is to define the custom classes in a separate file (module) and import it into your main file. Something like this:
Make a new file in same folder named custom_transformers.py with contents:
from sklearn.base import TransformerMixin, BaseEstimator
# Changed the base classes here, see Point 3
class PassthroughTransformer(BaseEstimator, TransformerMixin):
# I corrected the `fit()` method here, it should take X, y as input
def fit(self, X, y=None):
return self
def transform(self, X):
self.X = X
return X
# I have corrected the output here, See point 2
def get_feature_names(self):
return self.X.columns.tolist()
Now in your main file, do this:
from custom_transformers import PassthroughTransformer
For more information, see these questions:
You return self.X.values.tolist()
:-
Here X
is a Pandas DataFrame
, so X.values.tolist()
will return the actual data of the columns you specify, not the column names. So even if you solve the first error, you will get error in this. Correct this to:
return self.X.columns.tolist()
(Minor) Class inheriting:
You defined the PassthroughTransformer as:
PassthroughTransformer(FunctionTransformer, BaseEstimator)
FunctionTransformer
already inherits from BaseEstimator
so I dont think there is need to inherit from BaseEstimator
. You can change it in following ways:
class PassthroughTransformer(FunctionTransformer):
OR
# Standard way
class PassthroughTransformer(BaseEstimator, TransformerMixin):
Hope this helps.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With