Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

scikit-learn feature ranking returns identical values

I'm using scikit-learn's RFECV class to perform feature selection. I'm interested in identifying the relative importance of a bunch of variables. However, scikit-learn returns the same ranking (1) for multiple variables. This can also be seen in their example code:

>>> from sklearn.datasets import make_friedman1
>>> from sklearn.feature_selection import RFECV
>>> from sklearn.svm import SVR
>>> X, y = make_friedman1(n_samples=50, n_features=10, random_state=0)
>>> estimator = SVR(kernel="linear")
>>> selector = RFECV(estimator, step=1, cv=5)
>>> selector = selector.fit(X, y)
>>> selector.support_ 
array([ True,  True,  True,  True,  True, False, False, False, False,
       False])
>>> selector.ranking_
array([1, 1, 1, 1, 1, 6, 4, 3, 2, 5])

Is there a way I can make scikit-learn also identify the relative importance between the top features?

I'm happy to increase the number of trees or similar if that's needed. Related to this, is there a way to see the confidence of this ranking?

like image 271
pir Avatar asked Jun 04 '19 23:06

pir


People also ask

How does RFE ranking work?

RFE ranks features by the model's “coef” or “feature importances” attributes. It then recursively eliminates a minor number of features per loop, removing any existing dependencies and collinearities present in the model.

What is N_features_in_?

feature_names_in_ndarray of shape ( n_features_in_ ,) Names of features seen during fit. Defined only when X has feature names that are all strings. New in version 1.0.

What is SelectKBest method?

The SelectKBest method selects the features according to the k highest score. By changing the 'score_func' parameter we can apply the method for both classification and regression data. Selecting best features is important process when we prepare a large dataset for training.

What is SelectFromModel?

The SelectFromModel is a meta-estimator that determines the weight importance by comparing to the given threshold value. In this tutorial, we'll briefly learn how to select best features of regression data by using the SelectFromModel in Python.


1 Answers

The goal of RFECV is to select the optimum number of features, so it does cross-validation over the number of features selected. In your case, it selected to keep 5 features. Then the model is refit on the whole data set until only 5 features remain. These are not removed, so they are not ranked in RFE.

You could get a ranking for all features by just running RFE

from sklearn.datasets import make_friedman1
from sklearn.feature_selection import RFE
from sklearn.svm import SVR
X, y = make_friedman1(n_samples=50, n_features=10, random_state=0)
estimator = SVR(kernel="linear")
selector = RFE(estimator, step=1, n_features_to_select=1)
selector = selector.fit(X, y)
selector.ranking_

array([ 4, 3, 5, 1, 2, 10, 8, 7, 6, 9])

You might ask yourself why the ranking from the cross-validation is not kept, which computed a ranking for all features. However, for each split in the cross-validation, the features might have been ranked differently. So alternatively RFECV could return 5 different rankings and you could compare them. That's not the interface, though (but would also be easy to accomplish with RFE and doing the cv yourself).

On a different note, this might not be the best way to compute the influence of the features and looking at coefficients directly or maybe permutation importance might be more informative.

like image 64
Andreas Mueller Avatar answered Sep 27 '22 22:09

Andreas Mueller