I am using <code>recursive feature elimination with cross validation (rfecv)</code> as a feature selector for <code>randomforest classifier</code> as follows. <pre class="prettyprint"><code>X = df[[my_features]] #all my features y = df['gold_standard'] #labels clf = RandomForestClassifier(random_state = 42, class_weight="balanced") rfecv = RFECV(estimator=clf, step=1, cv=StratifiedKFold(10), scoring='roc_auc') rfecv.fit(X,y) print("Optimal number of features : %d" % rfecv.n_features_) features=list(X.columns[rfecv.support_]) </code></pre> I am also performing <code>GridSearchCV</code> as follows to tune the hyperparameters of <code>RandomForestClassifier</code> as follows. <pre class="prettyprint"><code>X = df[[my_features]] #all my features y = df['gold_standard'] #labels x_train, x_test, y_train, y_test = train_test_split(X, y, random_state=0) rfc = RandomForestClassifier(random_state=42, class_weight = 'balanced') param_grid = { 'n_estimators': [200, 500], 'max_features': ['auto', 'sqrt', 'log2'], 'max_depth' : [4,5,6,7,8], 'criterion' :['gini', 'entropy'] } k_fold = StratifiedKFold(n_splits=10, shuffle=True, random_state=0) CV_rfc = GridSearchCV(estimator=rfc, param_grid=param_grid, cv= k_fold, scoring = 'roc_auc') CV_rfc.fit(x_train, y_train) print(CV_rfc.best_params_) print(CV_rfc.best_score_) print(CV_rfc.best_estimator_) pred = CV_rfc.predict_proba(x_test)[:,1] print(roc_auc_score(y_test, pred)) </code></pre> However, I am not clear how to merge feature selection (<code>rfecv</code>) with <code>GridSearchCV</code>. EDIT: When I run the answer suggested by @Gambit I got the following error: <pre class="prettyprint"><code>ValueError: Invalid parameter criterion for estimator RFECV(cv=StratifiedKFold(n_splits=10, random_state=None, shuffle=False), estimator=RandomForestClassifier(bootstrap=True, class_weight='balanced', criterion='gini', max_depth=None, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators='warn', n_jobs=None, oob_score=False, random_state=42, verbose=0, warm_start=False), min_features_to_select=1, n_jobs=None, scoring='roc_auc', step=1, verbose=0). Check the list of available parameters with `estimator.get_params().keys()`. </code></pre> I could resolve the above issue by using <code>estimator__</code> in the <code>param_grid</code> parameter list. <hr> My question now is How to use the selected features and parameters in <code>x_test</code> to verify if the model works fine with unseen data. How can I obtain the <code>best features</code> and train it with the <code>optimal hyperparameters</code>? I am happy to provide more details if needed.

You can do what you want by prefixing the names of the parameters you want to pass to the estimator with <code>'estimator__'</code>. <pre class="prettyprint"><code>X = df[[my_features]] y = df[gold_standard] clf = RandomForestClassifier(random_state=0, class_weight="balanced") rfecv = RFECV(estimator=clf, step=1, cv=StratifiedKFold(3), scoring='roc_auc') param_grid = { 'estimator__n_estimators': [200, 500], 'estimator__max_features': ['auto', 'sqrt', 'log2'], 'estimator__max_depth' : [4,5,6,7,8], 'estimator__criterion' :['gini', 'entropy'] } k_fold = StratifiedKFold(n_splits=3, shuffle=True, random_state=0) CV_rfc = GridSearchCV(estimator=rfecv, param_grid=param_grid, cv= k_fold, scoring = 'roc_auc') X_train, X_test, y_train, y_test = train_test_split(X, y) CV_rfc.fit(X_train, y_train) </code></pre> Output on fake data I made: <pre class="prettyprint"><code>{'estimator__n_estimators': 200, 'estimator__max_depth': 6, 'estimator__criterion': 'entropy', 'estimator__max_features': 'auto'} 0.5653035605690997 RFECV(cv=StratifiedKFold(n_splits=3, random_state=None, shuffle=False), estimator=RandomForestClassifier(bootstrap=True, class_weight='balanced', criterion='entropy', max_depth=6, max_features='auto', max_leaf_nodes=None, min_impurity_decrease=0.0, min_impurity_split=None, min_samples_leaf=1, min_samples_split=2, min_weight_fraction_leaf=0.0, n_estimators=200, n_jobs=None, oob_score=False, random_state=0, verbose=0, warm_start=False), min_features_to_select=1, n_jobs=None, scoring='roc_auc', step=1, verbose=0) </code></pre>

How to perform feature selection with gridsearchcv in sklearn in python

Tags:

python

machine-learning

scikit-learn

data-science

grid-search

I am using recursive feature elimination with cross validation (rfecv) as a feature selector for randomforest classifier as follows.

X = df[[my_features]] #all my features
y = df['gold_standard'] #labels

clf = RandomForestClassifier(random_state = 42, class_weight="balanced")
rfecv = RFECV(estimator=clf, step=1, cv=StratifiedKFold(10), scoring='roc_auc')
rfecv.fit(X,y)

print("Optimal number of features : %d" % rfecv.n_features_)
features=list(X.columns[rfecv.support_])

I am also performing GridSearchCV as follows to tune the hyperparameters of RandomForestClassifier as follows.

X = df[[my_features]] #all my features
y = df['gold_standard'] #labels

x_train, x_test, y_train, y_test = train_test_split(X, y, random_state=0)

rfc = RandomForestClassifier(random_state=42, class_weight = 'balanced')
param_grid = { 
    'n_estimators': [200, 500],
    'max_features': ['auto', 'sqrt', 'log2'],
    'max_depth' : [4,5,6,7,8],
    'criterion' :['gini', 'entropy']
}
k_fold = StratifiedKFold(n_splits=10, shuffle=True, random_state=0)
CV_rfc = GridSearchCV(estimator=rfc, param_grid=param_grid, cv= k_fold, scoring = 'roc_auc')
CV_rfc.fit(x_train, y_train)
print(CV_rfc.best_params_)
print(CV_rfc.best_score_)
print(CV_rfc.best_estimator_)

pred = CV_rfc.predict_proba(x_test)[:,1]
print(roc_auc_score(y_test, pred))

However, I am not clear how to merge feature selection (rfecv) with GridSearchCV.

EDIT:

When I run the answer suggested by @Gambit I got the following error:

ValueError: Invalid parameter criterion for estimator RFECV(cv=StratifiedKFold(n_splits=10, random_state=None, shuffle=False),
   estimator=RandomForestClassifier(bootstrap=True, class_weight='balanced',
            criterion='gini', max_depth=None, max_features='auto',
            max_leaf_nodes=None, min_impurity_decrease=0.0,
            min_impurity_split=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators='warn', n_jobs=None, oob_score=False,
            random_state=42, verbose=0, warm_start=False),
   min_features_to_select=1, n_jobs=None, scoring='roc_auc', step=1,
   verbose=0). Check the list of available parameters with `estimator.get_params().keys()`.

I could resolve the above issue by using estimator__ in the param_grid parameter list.

My question now is How to use the selected features and parameters in x_test to verify if the model works fine with unseen data. How can I obtain the best features and train it with the optimal hyperparameters?

I am happy to provide more details if needed.

201

asked Apr 10 '19 09:04

EmJ

3 Answers

Basically you want to fine tune the hyper parameter of your classifier (with Cross validation) after feature selection using recursive feature elimination (with Cross validation).

Pipeline object is exactly meant for this purpose of assembling the data transformation and applying estimator.

May be you could use a different model (GradientBoostingClassifier, etc. ) for your final classification. It would be possible with the following approach:

from sklearn.datasets import load_breast_cancer
from sklearn.feature_selection import RFECV
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size=0.33, 
                                                    random_state=42)


from sklearn.pipeline import Pipeline

#this is the classifier used for feature selection
clf_featr_sele = RandomForestClassifier(n_estimators=30, 
                                        random_state=42,
                                        class_weight="balanced") 
rfecv = RFECV(estimator=clf_featr_sele, 
              step=1, 
              cv=5, 
              scoring = 'roc_auc')

#you can have different classifier for your final classifier
clf = RandomForestClassifier(n_estimators=10, 
                             random_state=42,
                             class_weight="balanced") 
CV_rfc = GridSearchCV(clf, 
                      param_grid={'max_depth':[2,3]},
                      cv= 5, scoring = 'roc_auc')

pipeline  = Pipeline([('feature_sele',rfecv),
                      ('clf_cv',CV_rfc)])

pipeline.fit(X_train, y_train)
pipeline.predict(X_test)

Now, you can apply this pipeline (Including feature selection) for test data.

148

answered Oct 25 '22 04:10

Venkatachalam

You can do what you want by prefixing the names of the parameters you want to pass to the estimator with 'estimator__'.

X = df[[my_features]]
y = df[gold_standard]

clf = RandomForestClassifier(random_state=0, class_weight="balanced")
rfecv = RFECV(estimator=clf, step=1, cv=StratifiedKFold(3), scoring='roc_auc')

param_grid = { 
    'estimator__n_estimators': [200, 500],
    'estimator__max_features': ['auto', 'sqrt', 'log2'],
    'estimator__max_depth' : [4,5,6,7,8],
    'estimator__criterion' :['gini', 'entropy']
}
k_fold = StratifiedKFold(n_splits=3, shuffle=True, random_state=0)

CV_rfc = GridSearchCV(estimator=rfecv, param_grid=param_grid, cv= k_fold, scoring = 'roc_auc')

X_train, X_test, y_train, y_test = train_test_split(X, y)

CV_rfc.fit(X_train, y_train)

Output on fake data I made:

{'estimator__n_estimators': 200, 'estimator__max_depth': 6, 'estimator__criterion': 'entropy', 'estimator__max_features': 'auto'}
0.5653035605690997
RFECV(cv=StratifiedKFold(n_splits=3, random_state=None, shuffle=False),
   estimator=RandomForestClassifier(bootstrap=True, class_weight='balanced',
            criterion='entropy', max_depth=6, max_features='auto',
            max_leaf_nodes=None, min_impurity_decrease=0.0,
            min_impurity_split=None, min_samples_leaf=1,
            min_samples_split=2, min_weight_fraction_leaf=0.0,
            n_estimators=200, n_jobs=None, oob_score=False, random_state=0,
            verbose=0, warm_start=False),
   min_features_to_select=1, n_jobs=None, scoring='roc_auc', step=1,
   verbose=0)

answered Oct 25 '22 02:10

gmds

You just need to pass the Recursive Feature Elimination Estimator directly into the GridSearchCV object. Something like this should work

X = df[my_features] #all my features
y = df['gold_standard'] #labels

clf = RandomForestClassifier(random_state = 42, class_weight="balanced")
rfecv = RFECV(estimator=clf, step=1, cv=StratifiedKFold(10), scoring='auc_roc')

param_grid = { 
    'n_estimators': [200, 500],
    'max_features': ['auto', 'sqrt', 'log2'],
    'max_depth' : [4,5,6,7,8],
    'criterion' :['gini', 'entropy']
}
k_fold = StratifiedKFold(n_splits=10, shuffle=True, random_state=0)

#------------- Just pass your RFECV object as estimator here directly --------#

CV_rfc = GridSearchCV(estimator=rfecv, param_grid=param_grid, cv= k_fold, scoring = 'roc_auc')


CV_rfc.fit(x_train, y_train)
print(CV_rfc.best_params_)
print(CV_rfc.best_score_)
print(CV_rfc.best_estimator_)

answered Oct 25 '22 04:10

Gambit1614

Related questions
                            
                                How to boost a Keras based neural network using AdaBoost?
                            
                                Python error: "socket.error: [Errno 11] Resource temporarily unavailable" when sending image
                            
                                Pandas: create dataframe without auto ordering column names alphabetically
                            
                                Sequentially read huge CSV file in python
                            
                                Pandas missing x tick labels [duplicate]
                            
                                Generate sql with subquery as a column in select statement using SQLAlchemy
                            
                                What is the explicit python3 type for dict_keys for isinstance() check?
                            
                                what does `yield from asyncio.sleep(delay)` do?
                            
                                how to get the name of column with maximum value in pyspark dataframe
                            
                                Swapping rows within the same pandas dataframe
                            
                                Why is my Protobuf message (in Python) ignoring zero values?
                            
                                Scatter plot with colormap makes X-axis disappear
                            
                                Efficiently download files asynchronously with requests
                            
                                Django REST: Uploading and serializing multiple images
                            
                                Python splitting list to sublists at given start/end keywords
                            
                                How to run a cron job with pipenv?
                            
                                PyTorch: Testing with torchvision.datasets.ImageFolder and DataLoader
                            
                                More than one estimator in GridSearchCV(sklearn)
                            
                                optimal way of defining a numerically stable sigmoid function for a list in python
                            
                                numpy.ufunc has the wrong size, try recompiling. even with the latest pandas and numpy versions

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With