Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Exhaustively feature selection in scikit-learn?

Tags:

scikit-learn

Is there any built in way of doing brute-force feature selection in scikit-learn, i.e. exhaustively evaluate all possible combinations of the input features, and then find the best subset? I am familiar with the "Recursive feature elimination" class but I am specifically interesting in evaluating all possible combinations of the input features one after the other.

like image 330
Dov Avatar asked Apr 09 '14 08:04

Dov


People also ask

Can you use logistic regression for feature selection?

L1 regularization introduces sparsity in the dataset, and it can use to perform feature selection by eliminating the features that are not important.

Can you use Lasso for feature selection?

LASSO, short for Least Absolute Shrinkage and Selection Operator, is a statistical formula whose main purpose is the feature selection and regularization of data models.

What is RFE feature selection?

Recursive feature elimination (RFE) is a feature selection method that fits a model and removes the weakest feature (or features) until the specified number of features is reached.

Can we use decision tree for feature selection?

Decision tree, a typical embedded feature selection algorithm, is widely used in machine learning and data mining (Sun & Hu, 2017). The classic methods to construct decision tree are ID3, C4. 5 and CART (Quinlan, 1979, Quinlan, 1986, Salzberg, 1994, Yeh, 1991). Among them, C4.


2 Answers

Combining the answer of Fred Foo and the comments of nopper, ihadanny and jimijazz, the following code gets the same results as the R function regsubsets() (part of the leaps library) for the first example in Lab 1 (6.5.1 Best Subset Selection) in the book "An Introduction to Statistical Learning with Applications in R".

from itertools import combinations
from sklearn.cross_validation import cross_val_score

def best_subset(estimator, X, y, max_size=8, cv=5):
'''Calculates the best model of up to max_size features of X.
   estimator must have a fit and score functions.
   X must be a DataFrame.'''

    n_features = X.shape[1]
    subsets = (combinations(range(n_features), k + 1) 
               for k in range(min(n_features, max_size)))

    best_size_subset = []
    for subsets_k in subsets:  # for each list of subsets of the same size
        best_score = -np.inf
        best_subset = None
        for subset in subsets_k: # for each subset
            estimator.fit(X.iloc[:, list(subset)], y)
            # get the subset with the best score among subsets of the same size
            score = estimator.score(X.iloc[:, list(subset)], y)
            if score > best_score:
                best_score, best_subset = score, subset
        # to compare subsets of different sizes we must use CV
        # first store the best subset of each size
        best_size_subset.append(best_subset)

    # compare best subsets of each size
    best_score = -np.inf
    best_subset = None
    list_scores = []
    for subset in best_size_subset:
        score = cross_val_score(estimator, X.iloc[:, list(subset)], y, cv=cv).mean()
        list_scores.append(score)
        if score > best_score:
            best_score, best_subset = score, subset

    return best_subset, best_score, best_size_subset, list_scores

See notebook at http://nbviewer.jupyter.org/github/pedvide/ISLR_Python/blob/master/Chapter6_Linear_Model_Selection_and_Regularization.ipynb#6.5.1-Best-Subset-Selection

like image 112
Pedro Villanueva Avatar answered Sep 22 '22 10:09

Pedro Villanueva


You might want to take a look at MLxtend's Exhaustive Feature Selector. It is obviously not built into scikit-learn (yet?) but does support its classifier and regressor objects.

like image 41
jorijnsmit Avatar answered Sep 18 '22 10:09

jorijnsmit