I'm using scikit-learn's RFECV class to perform feature selection. I'm interested in identifying the relative importance of a bunch of variables. However, scikit-learn returns the same ranking (1) for multiple variables. This can also be seen in their example code:
>>> from sklearn.datasets import make_friedman1
>>> from sklearn.feature_selection import RFECV
>>> from sklearn.svm import SVR
>>> X, y = make_friedman1(n_samples=50, n_features=10, random_state=0)
>>> estimator = SVR(kernel="linear")
>>> selector = RFECV(estimator, step=1, cv=5)
>>> selector = selector.fit(X, y)
>>> selector.support_
array([ True, True, True, True, True, False, False, False, False,
False])
>>> selector.ranking_
array([1, 1, 1, 1, 1, 6, 4, 3, 2, 5])
Is there a way I can make scikit-learn also identify the relative importance between the top features?
I'm happy to increase the number of trees or similar if that's needed. Related to this, is there a way to see the confidence of this ranking?
RFE ranks features by the model's “coef” or “feature importances” attributes. It then recursively eliminates a minor number of features per loop, removing any existing dependencies and collinearities present in the model.
feature_names_in_ndarray of shape ( n_features_in_ ,) Names of features seen during fit. Defined only when X has feature names that are all strings. New in version 1.0.
The SelectKBest method selects the features according to the k highest score. By changing the 'score_func' parameter we can apply the method for both classification and regression data. Selecting best features is important process when we prepare a large dataset for training.
The SelectFromModel is a meta-estimator that determines the weight importance by comparing to the given threshold value. In this tutorial, we'll briefly learn how to select best features of regression data by using the SelectFromModel in Python.
The goal of RFECV
is to select the optimum number of features, so it does cross-validation over the number of features selected.
In your case, it selected to keep 5 features.
Then the model is refit on the whole data set until only 5 features remain.
These are not removed, so they are not ranked in RFE.
You could get a ranking for all features by just running RFE
from sklearn.datasets import make_friedman1
from sklearn.feature_selection import RFE
from sklearn.svm import SVR
X, y = make_friedman1(n_samples=50, n_features=10, random_state=0)
estimator = SVR(kernel="linear")
selector = RFE(estimator, step=1, n_features_to_select=1)
selector = selector.fit(X, y)
selector.ranking_
array([ 4, 3, 5, 1, 2, 10, 8, 7, 6, 9])
You might ask yourself why the ranking from the cross-validation is not kept, which computed a ranking for all features. However, for each split in the cross-validation, the features might have been ranked differently. So alternatively RFECV could return 5 different rankings and you could compare them. That's not the interface, though (but would also be easy to accomplish with RFE and doing the cv yourself).
On a different note, this might not be the best way to compute the influence of the features and looking at coefficients directly or maybe permutation importance might be more informative.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With