I am using recursive feature elimination with cross validation (rfecv)
as a feature selector for randomforest classifier
as follows.
X = df[[my_features]] #all my features
y = df['gold_standard'] #labels
clf = RandomForestClassifier(random_state = 42, class_weight="balanced")
rfecv = RFECV(estimator=clf, step=1, cv=StratifiedKFold(10), scoring='roc_auc')
print("Optimal number of features : %d" % rfecv.n_features_)
I am also performing GridSearchCV
as follows to tune the hyperparameters of RandomForestClassifier
as follows.
X = df[[my_features]] #all my features
y = df['gold_standard'] #labels
x_train, x_test, y_train, y_test = train_test_split(X, y, random_state=0)
rfc = RandomForestClassifier(random_state=42, class_weight = 'balanced')
param_grid = {
'n_estimators': [200, 500],
'max_features': ['auto', 'sqrt', 'log2'],
'max_depth' : [4,5,6,7,8],
'criterion' :['gini', 'entropy']
k_fold = StratifiedKFold(n_splits=10, shuffle=True, random_state=0)
CV_rfc = GridSearchCV(estimator=rfc, param_grid=param_grid, cv= k_fold, scoring = 'roc_auc')
CV_rfc.fit(x_train, y_train)
pred = CV_rfc.predict_proba(x_test)[:,1]
print(roc_auc_score(y_test, pred))
However, I am not clear how to merge feature selection (rfecv
) with GridSearchCV
When I run the answer suggested by @Gambit I got the following error:
ValueError: Invalid parameter criterion for estimator RFECV(cv=StratifiedKFold(n_splits=10, random_state=None, shuffle=False),
estimator=RandomForestClassifier(bootstrap=True, class_weight='balanced',
criterion='gini', max_depth=None, max_features='auto',
max_leaf_nodes=None, min_impurity_decrease=0.0,
min_impurity_split=None, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
n_estimators='warn', n_jobs=None, oob_score=False,
random_state=42, verbose=0, warm_start=False),
min_features_to_select=1, n_jobs=None, scoring='roc_auc', step=1,
verbose=0). Check the list of available parameters with `estimator.get_params().keys()`.
I could resolve the above issue by using estimator__
in the param_grid
parameter list.
My question now is How to use the selected features and parameters in x_test
to verify if the model works fine with unseen data. How can I obtain the best features
and train it with the optimal hyperparameters
I am happy to provide more details if needed.
GridSearchCV tries all the combinations of the values passed in the dictionary and evaluates the model for each combination using the Cross-Validation method. Hence after using this function we get accuracy/loss for every combination of hyperparameters and we can choose the one with the best performance.
In a grid search, you try a grid of hyper-parameters and evaluate the performance of each combination of hyper-parameters. How does Sklearn’s GridSearchCV Work? The GridSearchCV class in Sklearn serves a dual purpose in tuning your model. The class allows you to: This tutorial won’t go into the details of k-fold cross validation.
In machine learning, you train models on a dataset and select the best performing model. One of the tools available to you in your search for the best model is Scikit-Learn’s GridSearchCV class. Why hyper-parameter tuning is important in building successful machine learning models
We’ll discuss feature selection in Python for training machine learning models. It’s important to identify the important features from a dataset and eliminate the less important features that don’t improve model accuracy. Model performance can be harmed by features that are irrelevant or only partially relevant.
Two different feature selection methods provided by the scikit-learn Python library are Recursive Feature Elimination and feature importance ranking. Take my free 2-week email course and discover data prep, algorithms and more (with code). Click to sign-up now and also get a free PDF Ebook version of the course.
Basically you want to fine tune the hyper parameter of your classifier (with Cross validation) after feature selection using recursive feature elimination (with Cross validation).
Pipeline object is exactly meant for this purpose of assembling the data transformation and applying estimator.
May be you could use a different model (GradientBoostingClassifier
, etc. ) for your final classification. It would be possible with the following approach:
from sklearn.datasets import load_breast_cancer
from sklearn.feature_selection import RFECV
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y,
from sklearn.pipeline import Pipeline
#this is the classifier used for feature selection
clf_featr_sele = RandomForestClassifier(n_estimators=30,
rfecv = RFECV(estimator=clf_featr_sele,
scoring = 'roc_auc')
#you can have different classifier for your final classifier
clf = RandomForestClassifier(n_estimators=10,
CV_rfc = GridSearchCV(clf,
cv= 5, scoring = 'roc_auc')
pipeline = Pipeline([('feature_sele',rfecv),
pipeline.fit(X_train, y_train)
Now, you can apply this pipeline (Including feature selection) for test data.
You can do what you want by prefixing the names of the parameters you want to pass to the estimator with 'estimator__'
X = df[[my_features]]
y = df[gold_standard]
clf = RandomForestClassifier(random_state=0, class_weight="balanced")
rfecv = RFECV(estimator=clf, step=1, cv=StratifiedKFold(3), scoring='roc_auc')
param_grid = {
'estimator__n_estimators': [200, 500],
'estimator__max_features': ['auto', 'sqrt', 'log2'],
'estimator__max_depth' : [4,5,6,7,8],
'estimator__criterion' :['gini', 'entropy']
k_fold = StratifiedKFold(n_splits=3, shuffle=True, random_state=0)
CV_rfc = GridSearchCV(estimator=rfecv, param_grid=param_grid, cv= k_fold, scoring = 'roc_auc')
X_train, X_test, y_train, y_test = train_test_split(X, y)
CV_rfc.fit(X_train, y_train)
Output on fake data I made:
{'estimator__n_estimators': 200, 'estimator__max_depth': 6, 'estimator__criterion': 'entropy', 'estimator__max_features': 'auto'}
RFECV(cv=StratifiedKFold(n_splits=3, random_state=None, shuffle=False),
estimator=RandomForestClassifier(bootstrap=True, class_weight='balanced',
criterion='entropy', max_depth=6, max_features='auto',
max_leaf_nodes=None, min_impurity_decrease=0.0,
min_impurity_split=None, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
n_estimators=200, n_jobs=None, oob_score=False, random_state=0,
verbose=0, warm_start=False),
min_features_to_select=1, n_jobs=None, scoring='roc_auc', step=1,
You just need to pass the Recursive Feature Elimination Estimator directly into the GridSearchCV
object. Something like this should work
X = df[my_features] #all my features
y = df['gold_standard'] #labels
clf = RandomForestClassifier(random_state = 42, class_weight="balanced")
rfecv = RFECV(estimator=clf, step=1, cv=StratifiedKFold(10), scoring='auc_roc')
param_grid = {
'n_estimators': [200, 500],
'max_features': ['auto', 'sqrt', 'log2'],
'max_depth' : [4,5,6,7,8],
'criterion' :['gini', 'entropy']
k_fold = StratifiedKFold(n_splits=10, shuffle=True, random_state=0)
#------------- Just pass your RFECV object as estimator here directly --------#
CV_rfc = GridSearchCV(estimator=rfecv, param_grid=param_grid, cv= k_fold, scoring = 'roc_auc')
CV_rfc.fit(x_train, y_train)
