I am trying to understand how to read <code>grid_scores_</code> and <code>ranking_</code> values in RFECV. Here is the main example from the documentation: <pre class="prettyprint"><code>from sklearn.datasets import make_friedman1 from sklearn.feature_selection import RFECV from sklearn.svm import SVR X, y = make_friedman1(n_samples=50, n_features=10, random_state=0) estimator = SVR(kernel="linear") selector = RFECV(estimator, step=1, cv=5) selector = selector.fit(X, y) selector.support_ array([ True, True, True, True, True, False, False, False, False, False], dtype=bool) selector.ranking_ array([1, 1, 1, 1, 1, 6, 4, 3, 2, 5]) </code></pre> How am I supposed to read <code>ranking_</code> and <code>grid_scores_</code>? Is the lower the ranking value the better? (or viceversa?). The reason why ask this is because I have noticed that the features with the highest ranking value, have typically the highest scores in <code>grid_scores_</code>. However, if something has a <code>ranking = 1</code> shouldn't that mean that it was ranked as the best of the group?. This is also what the documentation says: <blockquote> "Selected (i.e., estimated best) features are assigned rank 1" </blockquote> But now let's look at the following example using some real data: <pre class="prettyprint"><code>> rfecv.grid_scores_[np.nonzero(rfecv.ranking_ == 1)[0]] 0.0 </code></pre> while the feature with the highest ranking value has the highest *score*. <pre class="prettyprint"><code>> rfecv.grid_scores_[np.argmax(rfecv.ranking_ )] 0.997 </code></pre> Note that in the example above, the features with ranking=1 have the lowest score <h3>Figure in the documentation:</h3> On this matter, in this figure in the documentation, the <code>y</code> axis reads <code>"number of misclassifications"</code>, but it is plotting <code>grid_scores_</code> which used <code>'accuracy'</code> (?) as a scoring function. Shouldn't the <code>y</code> label read <code>accuracy</code>? (the higher the better) instead of <code>"number of misclassifications"</code> (the lower the better)

You are correct in that a low ranking value indicates a good feature and that a high cross-validation score in the <code>grid_scores_</code> attribute is also good, however you are misinterpreting what the values in <code>grid_scores_</code> mean. From the RFECV documentation <pre class="prettyprint"><code>grid_scores_ array of shape [n_subsets_of_features] The cross-validation scores such that grid_scores_[i] corresponds to the CV score of the i-th subset of features. </code></pre> Thus the <code>grid_scores_</code> values don't correspond to a particular feature, they are the cross-validation error metrics for subsets of features. In the example the subset with 5 features turns out to be the most informative set because the 5th value in <code>grid_scores_</code> (the CV value for the SVR model incorporating the 5 most highly ranked features) is the largest. You should also note that since the scoring metric is not explicitly specified, the scorer used is the default for SVR, which is R^2, not accuracy (which is only meaningful for classifiers).

Ranking and scores in Recursive Feature Elimination (RFE) in scikit-learn

Tags:

python

machine-learning

scikit-learn

I am trying to understand how to read grid_scores_ and ranking_ values in RFECV. Here is the main example from the documentation:

from sklearn.datasets import make_friedman1
from sklearn.feature_selection import RFECV
from sklearn.svm import SVR
X, y = make_friedman1(n_samples=50, n_features=10, random_state=0)
estimator = SVR(kernel="linear")
selector = RFECV(estimator, step=1, cv=5)
selector = selector.fit(X, y)
selector.support_ 
array([ True,  True,  True,  True,  True,
        False, False, False, False, False], dtype=bool)

selector.ranking_
array([1, 1, 1, 1, 1, 6, 4, 3, 2, 5])

How am I supposed to read ranking_ and grid_scores_? Is the lower the ranking value the better? (or viceversa?). The reason why ask this is because I have noticed that the features with the highest ranking value, have typically the highest scores in grid_scores_.

However, if something has a ranking = 1 shouldn't that mean that it was ranked as the best of the group?. This is also what the documentation says:

"Selected (i.e., estimated best) features are assigned rank 1"

But now let's look at the following example using some real data:

> rfecv.grid_scores_[np.nonzero(rfecv.ranking_ == 1)[0]]
0.0

while the feature with the highest ranking value has the highest *score*.

> rfecv.grid_scores_[np.argmax(rfecv.ranking_ )]
0.997

Note that in the example above, the features with ranking=1 have the lowest score

Figure in the documentation:

On this matter, in this figure in the documentation, the y axis reads "number of misclassifications", but it is plotting grid_scores_ which used 'accuracy' (?) as a scoring function. Shouldn't the y label read accuracy? (the higher the better) instead of "number of misclassifications" (the lower the better)

272

asked Aug 14 '13 23:08

Amelio Vazquez-Reina

1 Answers

You are correct in that a low ranking value indicates a good feature and that a high cross-validation score in the grid_scores_ attribute is also good, however you are misinterpreting what the values in grid_scores_ mean. From the RFECV documentation

grid_scores_

array of shape [n_subsets_of_features]

The cross-validation scores such that grid_scores_[i] corresponds to the CV score of the i-th subset of features.

Thus the grid_scores_ values don't correspond to a particular feature, they are the cross-validation error metrics for subsets of features. In the example the subset with 5 features turns out to be the most informative set because the 5th value in grid_scores_ (the CV value for the SVR model incorporating the 5 most highly ranked features) is the largest.

You should also note that since the scoring metric is not explicitly specified, the scorer used is the default for SVR, which is R^2, not accuracy (which is only meaningful for classifiers).

171

answered Nov 14 '22 21:11

DavidS

Related questions
                            
                                Unique features between multiple lists
                            
                                "WindowsError: Access is denied" on calling Process.terminate
                            
                                LookupError: unknown encoding: cp0
                            
                                Termination of python script while using ZeroMQ with dead server
                            
                                Object initializer syntax (c#) in python?
                            
                                How can I log IPython's output without the ugly 7 lines of logging-info on every load?
                            
                                Python: perform relative import when using __import__?
                            
                                Python threading vs. multiprocessing in Linux
                            
                                How to check which line of a Python script is being executed?
                            
                                Celery and custom consumers
                            
                                Clicking a link using selenium using python
                            
                                Debug with internal command window Python Tools and Vistual Studio 2013
                            
                                managing uWSGI with Upstart
                            
                                Multiple linear regression with python
                            
                                kivy: python for android + C++
                            
                                How to filter autocompletion results in django grappelli?
                            
                                What is the main use of the Python built-in 'compile'?
                            
                                Parse CSV with Django and csv.DictReader
                            
                                Python's immutable strings and their slices
                            
                                Remove commas in a string, surrounded by a comma and double quotes / Python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With