Is there any built in way of doing brute-force feature selection in scikit-learn, i.e. exhaustively evaluate all possible combinations of the input features, and then find the best subset? I am familiar with the "Recursive feature elimination" class but I am specifically interesting in evaluating all possible combinations of the input features one after the other.
L1 regularization introduces sparsity in the dataset, and it can use to perform feature selection by eliminating the features that are not important.
LASSO, short for Least Absolute Shrinkage and Selection Operator, is a statistical formula whose main purpose is the feature selection and regularization of data models.
Recursive feature elimination (RFE) is a feature selection method that fits a model and removes the weakest feature (or features) until the specified number of features is reached.
Decision tree, a typical embedded feature selection algorithm, is widely used in machine learning and data mining (Sun & Hu, 2017). The classic methods to construct decision tree are ID3, C4. 5 and CART (Quinlan, 1979, Quinlan, 1986, Salzberg, 1994, Yeh, 1991). Among them, C4.
Combining the answer of Fred Foo and the comments of nopper, ihadanny and jimijazz, the following code gets the same results as the R function regsubsets() (part of the leaps library) for the first example in Lab 1 (6.5.1 Best Subset Selection) in the book "An Introduction to Statistical Learning with Applications in R".
from itertools import combinations
from sklearn.cross_validation import cross_val_score
def best_subset(estimator, X, y, max_size=8, cv=5):
'''Calculates the best model of up to max_size features of X.
estimator must have a fit and score functions.
X must be a DataFrame.'''
n_features = X.shape[1]
subsets = (combinations(range(n_features), k + 1)
for k in range(min(n_features, max_size)))
best_size_subset = []
for subsets_k in subsets: # for each list of subsets of the same size
best_score = -np.inf
best_subset = None
for subset in subsets_k: # for each subset
estimator.fit(X.iloc[:, list(subset)], y)
# get the subset with the best score among subsets of the same size
score = estimator.score(X.iloc[:, list(subset)], y)
if score > best_score:
best_score, best_subset = score, subset
# to compare subsets of different sizes we must use CV
# first store the best subset of each size
best_size_subset.append(best_subset)
# compare best subsets of each size
best_score = -np.inf
best_subset = None
list_scores = []
for subset in best_size_subset:
score = cross_val_score(estimator, X.iloc[:, list(subset)], y, cv=cv).mean()
list_scores.append(score)
if score > best_score:
best_score, best_subset = score, subset
return best_subset, best_score, best_size_subset, list_scores
See notebook at http://nbviewer.jupyter.org/github/pedvide/ISLR_Python/blob/master/Chapter6_Linear_Model_Selection_and_Regularization.ipynb#6.5.1-Best-Subset-Selection
You might want to take a look at MLxtend's Exhaustive Feature Selector. It is obviously not built into scikit-learn
(yet?) but does support its classifier and regressor objects.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With