Can You Consistently Keep Track of Column Labels Using Sklearn's Transformer API?

Tags:

This seems like a very important issue for this library, and so far I don't see a decisive answer, although it seems like for the most part, the answer is 'No.'

Right now, any method that uses the transformer api in sklearn returns a numpy array as its results. Usually this is fine, but if you're chaining together a multi-step process that expands or reduces the number of columns, not having a clean way to track how they relate to the original column labels makes it difficult to use this section of the library to its fullest.

As an example, here's a snippet that I just recently used, where the inability to map new columns to the ones originally in the dataset was a big drawback:

numeric_columns = train.select_dtypes(include=np.number).columns.tolist()
cat_columns     = train.select_dtypes(include=np.object).columns.tolist()

numeric_pipeline = make_pipeline(SimpleImputer(strategy='median'), StandardScaler())
cat_pipeline     = make_pipeline(SimpleImputer(strategy='most_frequent'), OneHotEncoder())

transformers = [
('num', numeric_pipeline, numeric_columns),
('cat', cat_pipeline, cat_columns)
]

combined_pipe = ColumnTransformer(transformers)

train_clean = combined_pipe.fit_transform(train)

test_clean  = combined_pipe.transform(test)

In this example I split up my dataset using the ColumnTransformer and then added additional columns using the OneHotEncoder, so my arrangement of columns is not the same as what I started out with.

I could easily have different arrangements if I used different modules that use the same API. OrdinalEncoer, select_k_best, etc.

If you're doing multi-step transformations, is there a way to consistently see how your new columns relate to your original dataset?

There's an extensive discussion about it here, but I don't think anything has been finalized yet.

338

asked Aug 16 '19 16:08

Jonathan Bechtel

1 Answers

Yes, you are right that there isn't a complete support for tracking the feature_names in sklearn as of now. Initially, it was decide to keep it as generic at the level of numpy array. Latest progress on the feature names addition to sklearn estimators can be tracked here.

Anyhow, we can create wrappers to get the feature names of the ColumnTransformer. I am not sure whether it can capture all the possible types of ColumnTransformers. But at-least, it can solve your problem.

From `Documentation of ColumnTransformer`:

Notes

The order of the columns in the transformed feature matrix follows the order of how the columns are specified in the transformers list. Columns of the original feature matrix that are not specified are dropped from the resulting transformed feature matrix, unless specified in the passthrough keyword. Those columns specified with passthrough are added at the right to the output of the transformers.

Try this!

import pandas as pd
import numpy as np
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import make_pipeline, Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder, MinMaxScaler
from sklearn.feature_extraction.text import _VectorizerMixin
from sklearn.feature_selection._base import SelectorMixin
from sklearn.feature_selection import SelectKBest
from sklearn.feature_extraction.text import CountVectorizer

train = pd.DataFrame({'age': [23,12, 12, np.nan],
                      'Gender': ['M','F', np.nan, 'F'],
                      'income': ['high','low','low','medium'],
                      'sales': [10000, 100020, 110000, 100],
                      'foo' : [1,0,0,1],
                      'text': ['I will test this',
                               'need to write more sentence',
                               'want to keep it simple',
                               'hope you got that these sentences are junk'],
                      'y': [0,1,1,1]})
numeric_columns = ['age']
cat_columns     = ['Gender','income']

numeric_pipeline = make_pipeline(SimpleImputer(strategy='median'), StandardScaler())
cat_pipeline     = make_pipeline(SimpleImputer(strategy='most_frequent'), OneHotEncoder())
text_pipeline = make_pipeline(CountVectorizer(), SelectKBest(k=5))

transformers = [
    ('num', numeric_pipeline, numeric_columns),
    ('cat', cat_pipeline, cat_columns),
    ('text', text_pipeline, 'text'),
    ('simple_transformer', MinMaxScaler(), ['sales']),
]

combined_pipe = ColumnTransformer(
    transformers, remainder='passthrough')

transformed_data = combined_pipe.fit_transform(
    train.drop('y',1), train['y'])

def get_feature_out(estimator, feature_in):
    if hasattr(estimator,'get_feature_names'):
        if isinstance(estimator, _VectorizerMixin):
            # handling all vectorizers
            return [f'vec_{f}' \
                for f in estimator.get_feature_names()]
        else:
            return estimator.get_feature_names(feature_in)
    elif isinstance(estimator, SelectorMixin):
        return np.array(feature_in)[estimator.get_support()]
    else:
        return feature_in


def get_ct_feature_names(ct):
    # handles all estimators, pipelines inside ColumnTransfomer
    # doesn't work when remainder =='passthrough'
    # which requires the input column names.
    output_features = []

    for name, estimator, features in ct.transformers_:
        if name!='remainder':
            if isinstance(estimator, Pipeline):
                current_features = features
                for step in estimator:
                    current_features = get_feature_out(step, current_features)
                features_out = current_features
            else:
                features_out = get_feature_out(estimator, features)
            output_features.extend(features_out)
        elif estimator=='passthrough':
            output_features.extend(ct._feature_names_in[features])
                
    return output_features

pd.DataFrame(transformed_data, 
             columns=get_ct_feature_names(combined_pipe))

enter image description here

answered Oct 08 '22 09:10

Venkatachalam

Related questions
                            
                                How do I run a Python script on my web server? [closed]
                            
                                Meaning of unittest.main() in Python unittest module
                            
                                Is Python's time.time() timezone specific?
                            
                                Easy way to use parallel options of scikit-learn functions on HPC
                            
                                How do you develop against OpenID locally
                            
                                Installing PyGtk in virtualenv
                            
                                Understanding Multiprocessing: Shared Memory Management, Locks and Queues in Python
                            
                                Serializing output to JSON - ValueError: Circular reference detected
                            
                                How to create an encrypted ZIP file?
                            
                                Implementing webbased real time video chat using HTML5 websockets
                            
                                pytz utc conversion
                            
                                Why XGrabKey generates extra focus-out and focus-in events?
                            
                                What's the best way to do literate programming in Python on Windows? [closed]
                            
                                Destructuring dicts and objects in Python
                            
                                urllib3 connectionpool - Connection pool is full, discarding connection
                            
                                Why is the cmp parameter removed from sort/sorted in Python3.0?
                            
                                Invalid transaction persisting across requests
                            
                                Program web applications in python without a framework?
                            
                                Triple inheritance causes metaclass conflict... Sometimes
                            
                                Convert DataFrameGroupBy object to DataFrame pandas

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Can You Consistently Keep Track of Column Labels Using Sklearn's Transformer API?

Tags:

python

pandas

scikit-learn

Jonathan Bechtel

People also ask

1 Answers

From `Documentation of ColumnTransformer`:

Venkatachalam

Recent Activity

Donate For Us

Can You Consistently Keep Track of Column Labels Using Sklearn's Transformer API?

Tags:

python

pandas

scikit-learn

Jonathan Bechtel

People also ask

1 Answers

From Documentation of ColumnTransformer:

Venkatachalam

Related questions

Recent Activity

Donate For Us

From `Documentation of ColumnTransformer`: