Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Adding get_feature_names to ColumnTransformer pipeline

I'm trying to create an sklearn.compose.ColumnTransformer pipeline for transforming both categorical and continuous input data:

import pandas as pd

from sklearn.base import TransformerMixin, BaseEstimator
from sklearn.preprocessing import OneHotEncoder, FunctionTransformer
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer, make_column_transformer
from sklearn.impute import SimpleImputer

df = pd.DataFrame(
    {
            'a': [1, 'a', 1, np.nan, 'b'],
            'b': [1, 2, 3, 4, 5],
            'c': list('abcde'),
            'd': list('aaabb'),
            'e': [0, 1, 1, 0, 1],
    }
)

for col in df.select_dtypes('object'):
    df[col] = df[col].astype(str)

categorical_columns = list('acd')
continuous_columns = list('be')

categorical_transformer = OneHotEncoder(sparse=False, handle_unknown='ignore')
continuous_transformer = 'passthrough'

column_transformer = ColumnTransformer(
    [
        ('categorical', categorical_transformer, categorical_columns),
        ('continuous', continuous_transformer, continuous_columns),
    ]
    ,
    sparse_threshold=0.,
    n_jobs=-1
)

X = column_transformer.fit_transform(df)

I want to access the feature names created by this transformation pipeline, so I try this:

column_transformer.get_feature_names()

Which raises:

NotImplementedError: get_feature_names is not yet supported when using a 'passthrough' transformer.

Since I'm not technically doing anything with columns b and e, I technically could just append them onto X after one-hot encoding all other features, but is there some way I can use one of the scikit base classes (e.g. TransformerMixin, BaseEstimator, or FunctionTransformer) to add to this pipeline so I can grab the continuous feature names in a very pipeline-friendly way?

Something like this, perhaps:

class PassthroughTransformer(FunctionTransformer, BaseEstimator):
    def fit(self):
        return self
    def transform(self, X)
        self.X = X
        return X
    def get_feature_names(self):
        return self.X.values.tolist()

continuous_transformer = PassthroughTransformer()

column_transformer = ColumnTransformer(
    [
        ('categorical', categorical_transformer, categorical_columns),
        ('continuous', continuous_transformer, continuous_columns),
    ]
    ,
    sparse_threshold=0.,
    n_jobs=-1
)

X = column_transformer.fit_transform(df)

But this raises this exception:

TypeError: Cannot clone object '<__main__.PassthroughTransformer object at 0x1132ddf60>' (type <class '__main__.PassthroughTransformer'>): it does not seem to be a scikit-learn estimator as it does not implement a 'get_params' methods.
like image 219
blacksite Avatar asked Jan 01 '23 14:01

blacksite


1 Answers

There are multiple issues here:

  1. Cannot clone object error is due to parallel processing:

    By default, scikit-learn clones (which uses pickle) the supplied transformers and estimators when its working in Pipeline and similar (FeatureUnion, ColumnTransformer etc) or in cross-validation (cross_val_score, GridSearchCV etc).

    Now, you have specified n_jobs=-1 in your ColumnTransformer, which introduces multiprocessing in the code. Python's inbuilt pickling dont work well with multiprocessing. And hence the error.

    Options:

    1. Set n_jobs = 1 to not use multiprocessing. Still need to correct the code according to point 2 and 3.

    2. If you want to use multiprocessing, then simplest solution is to define the custom classes in a separate file (module) and import it into your main file. Something like this:

    Make a new file in same folder named custom_transformers.py with contents:

    from sklearn.base import TransformerMixin, BaseEstimator
    
    # Changed the base classes here, see Point 3
    class PassthroughTransformer(BaseEstimator, TransformerMixin):
    
        # I corrected the `fit()` method here, it should take X, y as input
        def fit(self, X, y=None):
            return self
    
        def transform(self, X):
            self.X = X
            return X
    
        # I have corrected the output here, See point 2
        def get_feature_names(self):
            return self.X.columns.tolist()
    

    Now in your main file, do this:

    from custom_transformers import PassthroughTransformer
    

    For more information, see these questions:

    • Python multiprocessing pickling error
    • Python: Can't pickle type X, attribute lookup failed
    • Custom sklearn pipeline transformer giving "pickle.PicklingError" (I am suggesting this workaround)
  2. You return self.X.values.tolist():-

    Here X is a Pandas DataFrame, so X.values.tolist() will return the actual data of the columns you specify, not the column names. So even if you solve the first error, you will get error in this. Correct this to:

    return self.X.columns.tolist()
    
  3. (Minor) Class inheriting:

    You defined the PassthroughTransformer as:

    PassthroughTransformer(FunctionTransformer, BaseEstimator)
    

    FunctionTransformer already inherits from BaseEstimator so I dont think there is need to inherit from BaseEstimator. You can change it in following ways:

    class PassthroughTransformer(FunctionTransformer):
    
                           OR
    # Standard way 
    class PassthroughTransformer(BaseEstimator, TransformerMixin):
    

Hope this helps.

like image 120
Vivek Kumar Avatar answered Jan 05 '23 15:01

Vivek Kumar