Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get feature names selected by feature elimination in sklearn pipeline?

I am using recursive feature elimination in my sklearn pipeline, the pipeline looks something like this:

from sklearn.pipeline import FeatureUnion, Pipeline
from sklearn import feature_selection
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC

X = ['I am a sentence', 'an example']
Y = [1, 2]
X_dev = ['another sentence']

# classifier
LinearSVC1 = LinearSVC(tol=1e-4,  C = 0.10000000000000001)
f5 = feature_selection.RFE(estimator=LinearSVC1, n_features_to_select=500, step=1)

pipeline = Pipeline([
    ('features', FeatureUnion([
       ('tfidf', TfidfVectorizer(ngram_range=(1, 3), max_features= 4000)), 
       ('custom_features', CustomFeatures())])),
    ('rfe_feature_selection', f5),
    ('clf', LinearSVC1),
    ])

pipeline.fit(X, Y)
y_pred = pipeline.predict(X_dev)

How can I get the feature names of features selected by the RFE? RFE should select the best 500 features, but I really need to take a look at what features have been selected.

EDIT:

I have a complex Pipeline which consists of multiple pipelines and feature unions, percentile feature selection and at the end Recursive Feature Elimination:

fs = feature_selection.SelectPercentile(feature_selection.chi2, percentile=90)
fs_vect = feature_selection.SelectPercentile(feature_selection.chi2, percentile=80)
f5 = feature_selection.RFE(estimator=svc, n_features_to_select=600, step=3)

countVecWord = TfidfVectorizer(ngram_range=(1, 3), max_features=2000, analyzer=u'word', sublinear_tf=True, use_idf = True, min_df=2, max_df=0.85, lowercase = True)
countVecWord_tags = TfidfVectorizer(ngram_range=(1, 4), max_features= 1000, analyzer=u'word', min_df=2, max_df=0.85, sublinear_tf=True, use_idf = True, lowercase = False)

pipeline = Pipeline([
        ('union', FeatureUnion(
                transformer_list=[

                ('vectorized_pipeline', Pipeline([
                    ('union_vectorizer', FeatureUnion([

                        ('stem_text', Pipeline([
                            ('selector', ItemSelector(key='stem_text')),
                            ('stem_tfidf', countVecWord)
                        ])),

                        ('pos_text', Pipeline([
                            ('selector', ItemSelector(key='pos_text')),
                            ('pos_tfidf', countVecWord_tags)
                        ])),

                    ])),
                        ('percentile_feature_selection', fs_vect)
                    ])),


                ('custom_pipeline', Pipeline([
                    ('custom_features', FeatureUnion([

                        ('pos_cluster', Pipeline([
                            ('selector', ItemSelector(key='pos_text')),
                            ('pos_cluster_inner', pos_cluster)
                        ])),

                        ('stylistic_features', Pipeline([
                            ('selector', ItemSelector(key='raw_text')),
                            ('stylistic_features_inner', stylistic_features)
                        ])),


                    ])),
                        ('percentile_feature_selection', fs),
                        ('inner_scale', inner_scaler)
                ])),

                ],

                # weight components in FeatureUnion
                # n_jobs=6,

                transformer_weights={
                    'vectorized_pipeline': 0.8,  # 0.8,
                    'custom_pipeline': 1.0  # 1.0
                },
        )),

        ('rfe_feature_selection', f5),
        ('clf', classifier),
        ])

I'll try to explain the steps. The first Pipeline consists of vectorizers and is called "vectorized_pipeline", all of these have a function "get_feature_names". The second Pipeline consists of my own features, I have implemented them with fit, transform and get_feature_names functions as well. When I use the suggestion of @Kevin, I get an error that 'union' (which is the name of my top element in the pipeline) does not have get_feature_names function:

support = pipeline.named_steps['rfe_feature_selection'].support_
feature_names = pipeline.named_steps['union'].get_feature_names()
print np.array(feature_names)[support]

Also, when I try to get feature names from individual FeatureUnions, like this:

support = pipeline.named_steps['rfe_feature_selection'].support_
feature_names = pipeline_age.named_steps['union_vectorizer'].get_feature_names()
print np.array(feature_names)[support]

I get a key error:

feature_names = pipeline.named_steps['union_vectorizer'].get_feature_names()
KeyError: 'union_vectorizer'
like image 401
Ivan Bilan Avatar asked Apr 14 '16 20:04

Ivan Bilan


People also ask

How does RFE ranking work?

RFE ranks features by the model's “coef” or “feature importances” attributes. It then recursively eliminates a minor number of features per loop, removing any existing dependencies and collinearities present in the model.

What is RFE technique?

Recursive feature elimination (RFE) is a feature selection method that fits a model and removes the weakest feature (or features) until the specified number of features is reached.

What is SelectFromModel?

The SelectFromModel is a meta-estimator that determines the weight importance by comparing to the given threshold value. In this tutorial, we'll briefly learn how to select best features of regression data by using the SelectFromModel in Python.


1 Answers

You can access each step of the Pipeline with the attribute named_steps, here's an example on the iris dataset, that only selects 2 features, but the solution will scale.

from sklearn import datasets
from sklearn import feature_selection
from sklearn.svm import LinearSVC

iris = datasets.load_iris()
X = iris.data
y = iris.target

# classifier
LinearSVC1 = LinearSVC(tol=1e-4,  C = 0.10000000000000001)
f5 = feature_selection.RFE(estimator=LinearSVC1, n_features_to_select=2, step=1)

pipeline = Pipeline([
    ('rfe_feature_selection', f5),
    ('clf', LinearSVC1)
    ])

pipeline.fit(X, y)

With named_steps you can access the attributes and methods of the transform object in the pipeline. The RFE attribute support_ (or the method get_support()) will return a boolean mask of the selected features:

support = pipeline.named_steps['rfe_feature_selection'].support_

Now support is an array, you can use that to efficiently extract the name of your selected features (columns). Make sure your feature names are in a numpy array, not a python list.

import numpy as np
feature_names = np.array(iris.feature_names) # transformed list to array

feature_names[support]

array(['sepal width (cm)', 'petal width (cm)'], 
      dtype='|S17')

EDIT

Per my comment above, here is your example with the CustomFeautures() function removed:

from sklearn.pipeline import FeatureUnion, Pipeline
from sklearn import feature_selection
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
import numpy as np

X = ['I am a sentence', 'an example']
Y = [1, 2]
X_dev = ['another sentence']

# classifier
LinearSVC1 = LinearSVC(tol=1e-4,  C = 0.10000000000000001)
f5 = feature_selection.RFE(estimator=LinearSVC1, n_features_to_select=500, step=1)

pipeline = Pipeline([
    ('features', FeatureUnion([
       ('tfidf', TfidfVectorizer(ngram_range=(1, 3), max_features= 4000))])), 
    ('rfe_feature_selection', f5),
    ('clf', LinearSVC1),
    ])

pipeline.fit(X, Y)
y_pred = pipeline.predict(X_dev)

support = pipeline.named_steps['rfe_feature_selection'].support_
feature_names = pipeline.named_steps['features'].get_feature_names()
np.array(feature_names)[support]
like image 80
Kevin Avatar answered Oct 14 '22 20:10

Kevin