I am using recursive feature elimination in my sklearn pipeline, the pipeline looks something like this:
from sklearn.pipeline import FeatureUnion, Pipeline
from sklearn import feature_selection
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
X = ['I am a sentence', 'an example']
Y = [1, 2]
X_dev = ['another sentence']
# classifier
LinearSVC1 = LinearSVC(tol=1e-4, C = 0.10000000000000001)
f5 = feature_selection.RFE(estimator=LinearSVC1, n_features_to_select=500, step=1)
pipeline = Pipeline([
('features', FeatureUnion([
('tfidf', TfidfVectorizer(ngram_range=(1, 3), max_features= 4000)),
('custom_features', CustomFeatures())])),
('rfe_feature_selection', f5),
('clf', LinearSVC1),
])
pipeline.fit(X, Y)
y_pred = pipeline.predict(X_dev)
How can I get the feature names of features selected by the RFE? RFE should select the best 500 features, but I really need to take a look at what features have been selected.
EDIT:
I have a complex Pipeline which consists of multiple pipelines and feature unions, percentile feature selection and at the end Recursive Feature Elimination:
fs = feature_selection.SelectPercentile(feature_selection.chi2, percentile=90)
fs_vect = feature_selection.SelectPercentile(feature_selection.chi2, percentile=80)
f5 = feature_selection.RFE(estimator=svc, n_features_to_select=600, step=3)
countVecWord = TfidfVectorizer(ngram_range=(1, 3), max_features=2000, analyzer=u'word', sublinear_tf=True, use_idf = True, min_df=2, max_df=0.85, lowercase = True)
countVecWord_tags = TfidfVectorizer(ngram_range=(1, 4), max_features= 1000, analyzer=u'word', min_df=2, max_df=0.85, sublinear_tf=True, use_idf = True, lowercase = False)
pipeline = Pipeline([
('union', FeatureUnion(
transformer_list=[
('vectorized_pipeline', Pipeline([
('union_vectorizer', FeatureUnion([
('stem_text', Pipeline([
('selector', ItemSelector(key='stem_text')),
('stem_tfidf', countVecWord)
])),
('pos_text', Pipeline([
('selector', ItemSelector(key='pos_text')),
('pos_tfidf', countVecWord_tags)
])),
])),
('percentile_feature_selection', fs_vect)
])),
('custom_pipeline', Pipeline([
('custom_features', FeatureUnion([
('pos_cluster', Pipeline([
('selector', ItemSelector(key='pos_text')),
('pos_cluster_inner', pos_cluster)
])),
('stylistic_features', Pipeline([
('selector', ItemSelector(key='raw_text')),
('stylistic_features_inner', stylistic_features)
])),
])),
('percentile_feature_selection', fs),
('inner_scale', inner_scaler)
])),
],
# weight components in FeatureUnion
# n_jobs=6,
transformer_weights={
'vectorized_pipeline': 0.8, # 0.8,
'custom_pipeline': 1.0 # 1.0
},
)),
('rfe_feature_selection', f5),
('clf', classifier),
])
I'll try to explain the steps. The first Pipeline consists of vectorizers and is called "vectorized_pipeline", all of these have a function "get_feature_names". The second Pipeline consists of my own features, I have implemented them with fit, transform and get_feature_names functions as well. When I use the suggestion of @Kevin, I get an error that 'union' (which is the name of my top element in the pipeline) does not have get_feature_names function:
support = pipeline.named_steps['rfe_feature_selection'].support_
feature_names = pipeline.named_steps['union'].get_feature_names()
print np.array(feature_names)[support]
Also, when I try to get feature names from individual FeatureUnions, like this:
support = pipeline.named_steps['rfe_feature_selection'].support_
feature_names = pipeline_age.named_steps['union_vectorizer'].get_feature_names()
print np.array(feature_names)[support]
I get a key error:
feature_names = pipeline.named_steps['union_vectorizer'].get_feature_names()
KeyError: 'union_vectorizer'
RFE ranks features by the model's “coef” or “feature importances” attributes. It then recursively eliminates a minor number of features per loop, removing any existing dependencies and collinearities present in the model.
Recursive feature elimination (RFE) is a feature selection method that fits a model and removes the weakest feature (or features) until the specified number of features is reached.
The SelectFromModel is a meta-estimator that determines the weight importance by comparing to the given threshold value. In this tutorial, we'll briefly learn how to select best features of regression data by using the SelectFromModel in Python.
You can access each step of the Pipeline
with the attribute named_steps
, here's an example on the iris dataset, that only selects 2
features, but the solution will scale.
from sklearn import datasets
from sklearn import feature_selection
from sklearn.svm import LinearSVC
iris = datasets.load_iris()
X = iris.data
y = iris.target
# classifier
LinearSVC1 = LinearSVC(tol=1e-4, C = 0.10000000000000001)
f5 = feature_selection.RFE(estimator=LinearSVC1, n_features_to_select=2, step=1)
pipeline = Pipeline([
('rfe_feature_selection', f5),
('clf', LinearSVC1)
])
pipeline.fit(X, y)
With named_steps
you can access the attributes and methods of the transform object in the pipeline. The RFE
attribute support_
(or the method get_support()
) will return a boolean mask of the selected features:
support = pipeline.named_steps['rfe_feature_selection'].support_
Now support
is an array, you can use that to efficiently extract the name of your selected features (columns). Make sure your feature names are in a numpy array
, not a python list.
import numpy as np
feature_names = np.array(iris.feature_names) # transformed list to array
feature_names[support]
array(['sepal width (cm)', 'petal width (cm)'],
dtype='|S17')
Per my comment above, here is your example with the CustomFeautures() function removed:
from sklearn.pipeline import FeatureUnion, Pipeline
from sklearn import feature_selection
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
import numpy as np
X = ['I am a sentence', 'an example']
Y = [1, 2]
X_dev = ['another sentence']
# classifier
LinearSVC1 = LinearSVC(tol=1e-4, C = 0.10000000000000001)
f5 = feature_selection.RFE(estimator=LinearSVC1, n_features_to_select=500, step=1)
pipeline = Pipeline([
('features', FeatureUnion([
('tfidf', TfidfVectorizer(ngram_range=(1, 3), max_features= 4000))])),
('rfe_feature_selection', f5),
('clf', LinearSVC1),
])
pipeline.fit(X, Y)
y_pred = pipeline.predict(X_dev)
support = pipeline.named_steps['rfe_feature_selection'].support_
feature_names = pipeline.named_steps['features'].get_feature_names()
np.array(feature_names)[support]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With