I am trying to combine recursive feature elimination and grid search in scikit-learn. As you can see from the code below (which works), I am able to get the best estimator from a grid search and then pass that estimator to RFECV. However, I would rather do the RFECV first, then the grid search. The problem is that when I pass the selector from RFECV to the grid search, it does not take it:
ValueError: Invalid parameter bootstrap for estimator RFECV
Is it possible to get the selector from RFECV and pass it directly to RandomizedSearchCV, or is this procedurally not the right thing to do?
from sklearn.datasets import make_classification
from sklearn.feature_selection import RFECV
from sklearn.grid_search import GridSearchCV, RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
from scipy.stats import randint as sp_randint
# Build a classification task using 3 informative features
X, y = make_classification(n_samples=1000, n_features=25, n_informative=5, n_redundant=2, n_repeated=0, n_classes=8, n_clusters_per_class=1, random_state=0)
grid = {"max_depth": [3, None],
"min_samples_split": sp_randint(1, 11),
"min_samples_leaf": sp_randint(1, 11),
"bootstrap": [True, False],
"criterion": ["gini", "entropy"]}
estimator = RandomForestClassifierCoef()
clf = RandomizedSearchCV(estimator, param_distributions=grid, cv=7)
clf.fit(X, y)
estimator = clf.best_estimator_
selector = RFECV(estimator, step=1, cv=4)
selector.fit(X, y)
selector.grid_scores_
param_grid – A dictionary with parameter names as keys and lists of parameter values. 3. scoring – The performance measure. For example, 'r2' for regression models, 'precision' for classification models.
Incidentally, scikit-learn is also called sklearn, so if you see the two terms, they mean the same thing. RFE can be used to handle problems presented by the two models listed below: Classification: Classification predicts the class of selected data points.
Recursive Feature Elimination(RFE) is the Wrapper method, i.e., it can ta. This algorithm fits a model and determines how significant features explain the variation in the dataset. Once the feature importance has been determined, it then removes those less important features one at a time in each iteration.
GridSearchCV is a function that comes in Scikit-learn's(or SK-learn) model_selection package.So an important point here to note is that we need to have the Scikit learn library installed on the computer. This function helps to loop through predefined hyperparameters and fit your estimator (model) on your training set.
The best way to do this would be to nest the RFECV inside the random search, using the method from this SO answer. Some example code, based on the question code and the SO answer mentioned above:
from sklearn.datasets import make_classification
from sklearn.feature_selection import RFECV
from sklearn.grid_search import GridSearchCV, RandomizedSearchCV
from sklearn.ensemble import RandomForestClassifier
from scipy.stats import randint as sp_randint
# Build a classification task using 5 informative features
X, y = make_classification(n_samples=1000, n_features=25, n_informative=5, n_redundant=2, n_repeated=0, n_classes=8, n_clusters_per_class=1, random_state=0)
grid = {"estimator__max_depth": [3, None],
"estimator__min_samples_split": sp_randint(1, 11),
"estimator__min_samples_leaf": sp_randint(1, 11),
"estimator__bootstrap": [True, False],
"estimator__criterion": ["gini", "entropy"]}
estimator = RandomForestClassifier()
selector = RFECV(estimator, step=1, cv=4)
clf = RandomizedSearchCV(selector, param_distributions=grid, cv=7)
clf.fit(X, y)
print(clf.grid_scores_)
print(clf.best_estimator_.n_features_)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With