Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Selecting a Specific Number of Features via Sklearn's RFECV (Recursive Feature Elimination with Cross-validation)

I'm wondering if it is possible for Sklearn's RFECV to select a fixed number of the most important features. For example, working on a dataset with 617 features, I have been trying to use RFECV to see which 5 of those features are the most significant. However, RFECV does not have the parameter 'n_features_to_select', unlike RFE (which confuses me). How should I deal with this?

like image 483
toenails_sauce Avatar asked Jul 04 '18 21:07

toenails_sauce


People also ask

What is recursive feature elimination with cross-validation?

Recursive Feature Elimination, Cross-Validated (RFECV) feature selection. Selects the best subset of features for the supplied estimator by removing 0 to N features (where N is the number of features) using recursive feature elimination, then selecting the best subset based on the cross-validation score of the model.

How do you select the number of features in RFE?

The RFE method is available via the RFE class in scikit-learn. RFE is a transform. To use it, first the class is configured with the chosen algorithm specified via the “estimator” argument and the number of features to select via the “n_features_to_select” argument.

Which feature selection technique used in recursive approach?

One such technique offered by Sklearn is Recursive Feature Elimination (RFE). It reduces model complexity by removing features one by one until the optimal number of features is left. It is one of the most popular feature selection algorithms due to its flexibility and ease of use.


1 Answers

According to this quora post

The RFECV object helps to tune or find this n_features parameter using cross-validation. For every step where "step" number of features are eliminated, it calculates the score on the validation data. The number of features left at the step which gives the maximum score on the validation data, is considered to be "the best n_features" of your data.

Which says RFECV determines the optimal number of features (n_features) to get best result.
The fitted RFECV object contains an attribute ranking_ with feature ranking, and support_ mask to select optimal features found.
However if you MUST select top n_features from RFECV you can use the ranking_ attribute

optimal_features = X[:, selector.support_] # selector is a RFECV fitted object

n = 6 # to select top 6 features
feature_ranks = selector.ranking_  # selector is a RFECV fitted object
feature_ranks_with_idx = enumerate(feature_ranks)
sorted_ranks_with_idx = sorted(feature_ranks_with_idx, key=lambda x: x[1])
top_n_idx = [idx for idx, rnk in sorted_ranks_with_idx[:n]]

top_n_features = X[:5, top_n_idx]

Reference: sklearn documentation, Quora post

like image 74
shanmuga Avatar answered Oct 20 '22 09:10

shanmuga