Sklearn Pipeline: Get feature names after OneHotEncode In ColumnTransformer

Tags:

I want to get feature names after I fit the pipeline.

categorical_features = ['brand', 'category_name', 'sub_category'] categorical_transformer = Pipeline(steps=[     ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),     ('onehot', OneHotEncoder(handle_unknown='ignore'))])      numeric_features = ['num1', 'num2', 'num3', 'num4'] numeric_transformer = Pipeline(steps=[     ('imputer', SimpleImputer(strategy='median')),     ('scaler', StandardScaler())])  preprocessor = ColumnTransformer(     transformers=[         ('num', numeric_transformer, numeric_features),         ('cat', categorical_transformer, categorical_features)])

Then

clf = Pipeline(steps=[('preprocessor', preprocessor),                       ('regressor', GradientBoostingRegressor())])

After fitting with pandas dataframe, I can get feature importances from

clf.steps[1][1].feature_importances_

and I tried clf.steps[0][1].get_feature_names() but I got an error

AttributeError: Transformer num (type Pipeline) does not provide get_feature_names.

How can I get feature names from this?

609

asked Feb 12 '19 09:02

ResidentSleeper

2 Answers

You can access the feature_names using the following snippet!

clf.named_steps['preprocessor'].transformers_[1][1]\    .named_steps['onehot'].get_feature_names(categorical_features)

Using sklearn >= 0.21 version, we can make it more simpler:

clf['preprocessor'].transformers_[1][1]\     ['onehot'].get_feature_names(categorical_features)

Reproducible example:

import numpy as np import pandas as pd from sklearn.impute import SimpleImputer from sklearn.preprocessing import OneHotEncoder, StandardScaler from sklearn.pipeline import Pipeline from sklearn.compose import ColumnTransformer from sklearn.linear_model import LinearRegression  df = pd.DataFrame({'brand': ['aaaa', 'asdfasdf', 'sadfds', 'NaN'],                    'category': ['asdf', 'asfa', 'asdfas', 'as'],                    'num1': [1, 1, 0, 0],                    'target': [0.2, 0.11, 1.34, 1.123]})  numeric_features = ['num1'] numeric_transformer = Pipeline(steps=[     ('imputer', SimpleImputer(strategy='median')),     ('scaler', StandardScaler())])  categorical_features = ['brand', 'category'] categorical_transformer = Pipeline(steps=[     ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),     ('onehot', OneHotEncoder(handle_unknown='ignore'))])  preprocessor = ColumnTransformer(     transformers=[         ('num', numeric_transformer, numeric_features),         ('cat', categorical_transformer, categorical_features)])  clf = Pipeline(steps=[('preprocessor', preprocessor),                       ('regressor',  LinearRegression())]) clf.fit(df.drop('target', 1), df['target'])  clf.named_steps['preprocessor'].transformers_[1][1]\    .named_steps['onehot'].get_feature_names(categorical_features)  # ['brand_NaN' 'brand_aaaa' 'brand_asdfasdf' 'brand_sadfds' 'category_as' #  'category_asdf' 'category_asdfas' 'category_asfa']

186

answered Sep 19 '22 17:09

Venkatachalam

Scikit-Learn 1.0 now has new features to keep track of feature names.

from sklearn.compose import make_column_transformer from sklearn.impute import SimpleImputer from sklearn.linear_model import LinearRegression from sklearn.pipeline import make_pipeline from sklearn.preprocessing import StandardScaler  # SimpleImputer does not have get_feature_names_out, so we need to add it # manually. This should be fixed in Scikit-Learn 1.0.1: all transformers will # have this method. # g SimpleImputer.get_feature_names_out = (lambda self, names=None:                                        self.feature_names_in_)  num_pipeline = make_pipeline(SimpleImputer(), StandardScaler()) transformer = make_column_transformer(     (num_pipeline, ["age", "height"]),     (OneHotEncoder(), ["city"])) pipeline = make_pipeline(transformer, LinearRegression())    df = pd.DataFrame({"city": ["Rabat", "Tokyo", "Paris", "Auckland"],                    "age": [32, 65, 18, 24],                    "height": [172, 163, 169, 190],                    "weight": [65, 62, 54, 95]},                   index=["Alice", "Bunji", "Cécile", "Dave"])    pipeline.fit(df, df["weight"])   ## get pipeline feature names pipeline[:-1].get_feature_names_out()   ## specify feature names as your columns pd.DataFrame(pipeline[:-1].transform(df),              columns=pipeline[:-1].get_feature_names_out(),              index=df.index)

answered Sep 21 '22 17:09

ZAKARYA ROUZKI

Related questions
                            
                                I have Python on my Ubuntu system, but gcc can't find Python.h
                            
                                Python: Find index of minimum item in list of floats [duplicate]
                            
                                Python "pip install " is failing with AttributeError: 'module' object has no attribute 'SSL_ST_INIT'
                            
                                How to reinstall python@2 from Homebrew?
                            
                                Python: avoiding Pylint warnings about too many arguments
                            
                                Programming Android apps in jython
                            
                                How to approach a number guessing game (with a twist) algorithm?
                            
                                Python: Inflate and Deflate implementations
                            
                                What is the difference between numpy.fft and scipy.fftpack?
                            
                                Implementing Single Sign On (SSO) using Django [closed]
                            
                                Hitting Maximum Recursion Depth Using Pickle / cPickle
                            
                                How to use g.user global in flask
                            
                                How to update-alternatives to Python 3 without breaking apt?
                            
                                Recursive pattern in regex
                            
                                What is the complexity of the sorted() function?
                            
                                What are some possible calculations with numpy or scipy that can return a NaN? [closed]
                            
                                design of python: why is assert a statement and not a function?
                            
                                Multi-threaded use of SQLAlchemy
                            
                                Python type hints: typing.Mapping vs. typing.Dict
                            
                                More efficient movements editing python files in vim

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Sklearn Pipeline: Get feature names after OneHotEncode In ColumnTransformer

Tags:

python

scikit-learn

pipeline

ResidentSleeper

People also ask

2 Answers

Venkatachalam

ZAKARYA ROUZKI

Recent Activity

Donate For Us