Exhaustively feature selection in scikit-learn?

Tags:

scikit-learn

Is there any built in way of doing brute-force feature selection in scikit-learn, i.e. exhaustively evaluate all possible combinations of the input features, and then find the best subset? I am familiar with the "Recursive feature elimination" class but I am specifically interesting in evaluating all possible combinations of the input features one after the other.

330

asked Apr 09 '14 08:04

Dov

2 Answers

Combining the answer of Fred Foo and the comments of nopper, ihadanny and jimijazz, the following code gets the same results as the R function regsubsets() (part of the leaps library) for the first example in Lab 1 (6.5.1 Best Subset Selection) in the book "An Introduction to Statistical Learning with Applications in R".

from itertools import combinations
from sklearn.cross_validation import cross_val_score

def best_subset(estimator, X, y, max_size=8, cv=5):
'''Calculates the best model of up to max_size features of X.
   estimator must have a fit and score functions.
   X must be a DataFrame.'''

    n_features = X.shape[1]
    subsets = (combinations(range(n_features), k + 1) 
               for k in range(min(n_features, max_size)))

    best_size_subset = []
    for subsets_k in subsets:  # for each list of subsets of the same size
        best_score = -np.inf
        best_subset = None
        for subset in subsets_k: # for each subset
            estimator.fit(X.iloc[:, list(subset)], y)
            # get the subset with the best score among subsets of the same size
            score = estimator.score(X.iloc[:, list(subset)], y)
            if score > best_score:
                best_score, best_subset = score, subset
        # to compare subsets of different sizes we must use CV
        # first store the best subset of each size
        best_size_subset.append(best_subset)

    # compare best subsets of each size
    best_score = -np.inf
    best_subset = None
    list_scores = []
    for subset in best_size_subset:
        score = cross_val_score(estimator, X.iloc[:, list(subset)], y, cv=cv).mean()
        list_scores.append(score)
        if score > best_score:
            best_score, best_subset = score, subset

    return best_subset, best_score, best_size_subset, list_scores

See notebook at http://nbviewer.jupyter.org/github/pedvide/ISLR_Python/blob/master/Chapter6_Linear_Model_Selection_and_Regularization.ipynb#6.5.1-Best-Subset-Selection

112

answered Sep 22 '22 10:09

Pedro Villanueva

You might want to take a look at MLxtend's Exhaustive Feature Selector. It is obviously not built into scikit-learn (yet?) but does support its classifier and regressor objects.

answered Sep 18 '22 10:09

jorijnsmit

Related questions
                            
                                Is Apache Spark less accurate than Scikit Learn?
                            
                                Resampling in scikit-learn and/or pandas
                            
                                Use a metric after a classifier in a Pipeline
                            
                                How do you visualize a ward tree from sklearn.cluster.ward_tree?
                            
                                How to get the first canonical correlation from sklearn's CCA module?
                            
                                Spark Multi Label classification
                            
                                Is the xgboost documentation wrong ? (early stopping rounds and best and last iteration)
                            
                                Setting feature weights for KNN
                            
                                Should binary features be one-hot encoded?
                            
                                Can I add outlier detection and removal to Scikit learn Pipeline?
                            
                                PCA memory error in Sklearn: Alternative Dim Reduction?
                            
                                MiniBatchKMeans Parameters
                            
                                sklearn: User defined cross validation for time series data
                            
                                StratifiedKFold vs StratifiedShuffleSplit vs StratifiedKFold + Shuffle
                            
                                Evaluating Logistic regression with cross validation
                            
                                What is the output of clf.tree_.feature?
                            
                                Dummy creation in pipeline with different levels in train and test set
                            
                                How does sp_randint work?
                            
                                Scikit-learn Agglomerative Clustering Connectivity Matrix
                            
                                Distances between rankings

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With