Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Doing hyperparameter estimation for the estimator in each fold of Recursive Feature Elimination

I am using sklearn to carry out recursive feature elimination with cross-validation, using the RFECV module. RFE involves repeatedly training an estimator on the full set of features, then removing the least informative features, until converging on the optimal number of features.

In order to obtain optimal performance by the estimator, I want to select the best hyperparameters for the estimator for each number of features(edited for clarity). The estimator is a linear SVM so I am only looking into the C parameter.

Initially, my code was as follows. However, this just did one grid search for C at the beginning, and then used the same C for each iteration.

from sklearn.cross_validation import StratifiedKFold
from sklearn.feature_selection import RFECV
from sklearn import svm, grid_search

def get_best_feats(data,labels,c_values):

    parameters = {'C':c_values}

    # svm1 passed to clf which is used to grid search the best parameters
    svm1 = SVC(kernel='linear')
    clf = grid_search.GridSearchCV(svm1, parameters, refit=True)
    clf.fit(data,labels)
    #print 'best gamma',clf.best_params_['gamma']

    # svm2 uses the optimal hyperparameters from svm1
    svm2 = svm.SVC(C=clf.best_params_['C'], kernel='linear')
    #svm2 is then passed to RFECVv as the estimator for recursive feature elimination
    rfecv = RFECV(estimator=svm2, step=1, cv=StratifiedKFold(labels, 5))      
    rfecv.fit(data,labels)                                                     

    print "support:",rfecv.support_
    return data[:,rfecv.support_]

The documentation for RFECV gives the parameter "estimator_params : Parameters for the external estimator. Useful for doing grid searches when an RFE object is passed as an argument to, e.g., a sklearn.grid_search.GridSearchCV object."

Therefore I want to try to pass my object 'rfecv' to the grid search object, as follows:

def get_best_feats2(data,labels,c_values):

    parameters = {'C':c_values   
    svm1 = SVC(kernel='linear')
    rfecv = RFECV(estimator=svm1, step=1, cv=StratifiedKFold(labels, 5), estimator_params=parameters)
    rfecv.fit(data, labels)

    print "Kept {} out of {} features".format((data[:,rfecv.support_]).shape[1], data.shape[1])


    print "support:",rfecv.support_
    return data[:,rfecv.support_]

X,y = get_heart_data()


c_values = [0.1,1.,10.]
get_best_feats2(X,y,c_values)

But this returns the error:

max_iter=self.max_iter, random_seed=random_seed)
File "libsvm.pyx", line 59, in sklearn.svm.libsvm.fit (sklearn/svm   /libsvm.c:1674)
TypeError: a float is required

So my question is: how can I pass the rfe object to the grid search in order to do cross-validation for each iteration of recursive feature elimination?

Thanks

like image 673
user3140106 Avatar asked Apr 09 '15 12:04

user3140106


People also ask

What is recursive feature elimination method?

Recursive feature elimination (RFE) is a feature selection method that fits a model and removes the weakest feature (or features) until the specified number of features is reached.

Is K in K fold cross-validation a Hyperparameter?

This highlights that the k-fold cross-validation procedure is used both in the selection of model hyperparameters to configure each model and in the selection of configured models.

Which feature selection technique used in recursive approach?

Recursive Feature Elimination, or RFE for short, is a popular feature selection algorithm. RFE is popular because it is easy to configure and use and because it is effective at selecting those features (columns) in a training dataset that are more or most relevant in predicting the target variable.

How does RFE ranking work?

RFE is also a type of backward selection method however RFE works on feature ranking system. First model is fit on linear regression based on all variables. Then it calculates variable coefficients and their importance.


1 Answers

So you want to grid-search the C in the SVM for each number of features in the RFE? Or for each CV iteration in the RFECV? From your last sentence, I guess it is the former.

You can do RFE(GridSearchCV(SVC(), param_grid)) to achieve that, though I'm not sure that is actually a helpful thing to do.

I don't think the second is possible right now (but soon). You could do GridSeachCV(RFECV(), param_grid={'estimator__C': Cs_to_try}), but that nests two sets of cross-validation inside each other.

Update: GridSearchCV has no coef_, so the first one fails. A simple fix:

class GridSeachWithCoef(GridSearchCV):
    @property
    def coef_(self):
        return self.best_estimator_.coef_

And then use that instead.

like image 165
Andreas Mueller Avatar answered Oct 17 '22 06:10

Andreas Mueller