I'm using scikit-learn's RFECV class to perform feature selection. I'm interested in identifying the relative importance of a bunch of variables. However, scikit-learn returns the same ranking (1) for multiple variables. This can also be seen in their example code: <pre class="prettyprint"><code>>>> from sklearn.datasets import make_friedman1 >>> from sklearn.feature_selection import RFECV >>> from sklearn.svm import SVR >>> X, y = make_friedman1(n_samples=50, n_features=10, random_state=0) >>> estimator = SVR(kernel="linear") >>> selector = RFECV(estimator, step=1, cv=5) >>> selector = selector.fit(X, y) >>> selector.support_ array([ True, True, True, True, True, False, False, False, False, False]) >>> selector.ranking_ array([1, 1, 1, 1, 1, 6, 4, 3, 2, 5]) </code></pre> Is there a way I can make scikit-learn also identify the relative importance between the top features? I'm happy to increase the number of trees or similar if that's needed. Related to this, is there a way to see the confidence of this ranking?

The goal of <code>RFECV</code> is to select the optimum number of features, so it does cross-validation over the number of features selected. In your case, it selected to keep 5 features. Then the model is refit on the whole data set until only 5 features remain. These are not removed, so they are not ranked in RFE. You could get a ranking for all features by just running RFE <pre class="prettyprint"><code>from sklearn.datasets import make_friedman1 from sklearn.feature_selection import RFE from sklearn.svm import SVR X, y = make_friedman1(n_samples=50, n_features=10, random_state=0) estimator = SVR(kernel="linear") selector = RFE(estimator, step=1, n_features_to_select=1) selector = selector.fit(X, y) selector.ranking_ </code></pre> <blockquote> array([ 4, 3, 5, 1, 2, 10, 8, 7, 6, 9]) </blockquote> You might ask yourself why the ranking from the cross-validation is not kept, which computed a ranking for all features. However, for each split in the cross-validation, the features might have been ranked differently. So alternatively RFECV could return 5 different rankings and you could compare them. That's not the interface, though (but would also be easy to accomplish with RFE and doing the cv yourself). On a different note, this might not be the best way to compute the influence of the features and looking at coefficients directly or maybe permutation importance might be more informative.

scikit-learn feature ranking returns identical values

Tags:

python

machine-learning

scikit-learn

feature-selection

I'm using scikit-learn's RFECV class to perform feature selection. I'm interested in identifying the relative importance of a bunch of variables. However, scikit-learn returns the same ranking (1) for multiple variables. This can also be seen in their example code:

>>> from sklearn.datasets import make_friedman1
>>> from sklearn.feature_selection import RFECV
>>> from sklearn.svm import SVR
>>> X, y = make_friedman1(n_samples=50, n_features=10, random_state=0)
>>> estimator = SVR(kernel="linear")
>>> selector = RFECV(estimator, step=1, cv=5)
>>> selector = selector.fit(X, y)
>>> selector.support_ 
array([ True,  True,  True,  True,  True, False, False, False, False,
       False])
>>> selector.ranking_
array([1, 1, 1, 1, 1, 6, 4, 3, 2, 5])

Is there a way I can make scikit-learn also identify the relative importance between the top features?

I'm happy to increase the number of trees or similar if that's needed. Related to this, is there a way to see the confidence of this ranking?

271

asked Jun 04 '19 23:06

pir

1 Answers

The goal of RFECV is to select the optimum number of features, so it does cross-validation over the number of features selected. In your case, it selected to keep 5 features. Then the model is refit on the whole data set until only 5 features remain. These are not removed, so they are not ranked in RFE.

You could get a ranking for all features by just running RFE

from sklearn.datasets import make_friedman1
from sklearn.feature_selection import RFE
from sklearn.svm import SVR
X, y = make_friedman1(n_samples=50, n_features=10, random_state=0)
estimator = SVR(kernel="linear")
selector = RFE(estimator, step=1, n_features_to_select=1)
selector = selector.fit(X, y)
selector.ranking_

array([ 4, 3, 5, 1, 2, 10, 8, 7, 6, 9])

You might ask yourself why the ranking from the cross-validation is not kept, which computed a ranking for all features. However, for each split in the cross-validation, the features might have been ranked differently. So alternatively RFECV could return 5 different rankings and you could compare them. That's not the interface, though (but would also be easy to accomplish with RFE and doing the cv yourself).

On a different note, this might not be the best way to compute the influence of the features and looking at coefficients directly or maybe permutation importance might be more informative.

answered Sep 27 '22 22:09

Andreas Mueller

Related questions
                            
                                Is there a builtin way to define a function that takes either 1 argument or 3?
                            
                                multivariable linearization in python: 'Pow' object has no attribute 'sqrt'
                            
                                How to make Pycharm run all python unit tests recursively from tests folder
                            
                                I can't import tensorflow-gpu
                            
                                Find the current line number of a running python process
                            
                                Airflow : ExternalTaskSensor doesn't trigger the task
                            
                                Python sum list of dicts by key with nested dicts
                            
                                Efficiently aggregate a resampled collection of datetimes in pandas
                            
                                Loading hdf5 files into python xarrays
                            
                                How can i use tensorflow object detection to only detect persons?
                            
                                Why is cross_val_predict not appropriate for measuring the generalisation error?
                            
                                Does Buildout support value substitution in the extends option?
                            
                                Storing RTSP stream as video file with OpenCV VideoWriter
                            
                                How to configure Python to ignore the hostname verification?
                            
                                Run command from one container to another
                            
                                How to join data from multiple netCDF files with xarray in Python?
                            
                                How to add tqdm to show progress bar when downloading you tube video with pytube?
                            
                                Commands with multiple common options going into one argument using custom decorator
                            
                                Is there a way to set transparency/alpha level in a seaborn pointplot?
                            
                                How to run an Asyncio task without awaiting?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With