Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Grid Search with Recursive Feature Elimination in scikit-learn pipeline returns an error

I am trying to chain Grid Search and Recursive Feature Elimination in a Pipeline using scikit-learn.

GridSearchCV and RFE with "bare" classifier works fine:

from sklearn.datasets import make_friedman1
from sklearn import feature_selection
from sklearn.grid_search import GridSearchCV
from sklearn.svm import SVR

X, y = make_friedman1(n_samples=50, n_features=10, random_state=0)

est = SVR(kernel="linear")

selector = feature_selection.RFE(est)
param_grid = dict(estimator__C=[0.1, 1, 10])
clf = GridSearchCV(selector, param_grid=param_grid, cv=10)
clf.fit(X, y)

Putting classifier in a pipeline returns an error: RuntimeError: The classifier does not expose "coef_" or "feature_importances_" attributes

from sklearn.datasets import make_friedman1
from sklearn import feature_selection
from sklearn import preprocessing
from sklearn import pipeline
from sklearn.grid_search import GridSearchCV
from sklearn.svm import SVR

X, y = make_friedman1(n_samples=50, n_features=10, random_state=0)

est = SVR(kernel="linear")

std_scaler = preprocessing.StandardScaler()
pipe_params = [('std_scaler', std_scaler), ('clf', est)]
pipe = pipeline.Pipeline(pipe_params)

selector = feature_selection.RFE(pipe)
param_grid = dict(estimator__clf__C=[0.1, 1, 10])
clf = GridSearchCV(selector, param_grid=param_grid, cv=10)
clf.fit(X, y)

EDIT:

I have realised that I was not clear describing the problem. This is the clearer snippet:

from sklearn.datasets import make_friedman1
from sklearn import feature_selection
from sklearn import pipeline
from sklearn.grid_search import GridSearchCV
from sklearn.svm import SVR

X, y = make_friedman1(n_samples=50, n_features=10, random_state=0)

# This will work
est = SVR(kernel="linear")
selector = feature_selection.RFE(est)
clf = GridSearchCV(selector, param_grid={'estimator__C': [1, 10]})
clf.fit(X, y)

# This will not work
est = pipeline.make_pipeline(SVR(kernel="linear"))
selector = feature_selection.RFE(est)
clf = GridSearchCV(selector, param_grid={'estimator__svr__C': [1, 10]})
clf.fit(X, y)

As you can see, the only difference is putting the estimator in a pipeline. Pipeline, however, hides "coef_" or "feature_importances_" attributes. The questions are:

  1. Is there a nice way of dealing with this in scikit-learn?
  2. If not, is this behaviour desired for any reason?

EDIT2:

Updated, working snippet based on the answer provided by @Chris

from sklearn.datasets import make_friedman1
from sklearn import feature_selection
from sklearn import pipeline
from sklearn.grid_search import GridSearchCV
from sklearn.svm import SVR


class MyPipe(pipeline.Pipeline):

    def fit(self, X, y=None, **fit_params):
        """Calls last elements .coef_ method.
        Based on the sourcecode for decision_function(X).
        Link: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/pipeline.py
        ----------
        """
        super(MyPipe, self).fit(X, y, **fit_params)
        self.coef_ = self.steps[-1][-1].coef_
        return self


X, y = make_friedman1(n_samples=50, n_features=10, random_state=0)

# Without Pipeline
est = SVR(kernel="linear")
selector = feature_selection.RFE(est)
clf = GridSearchCV(selector, param_grid={'estimator__C': [1, 10, 100]})
clf.fit(X, y)
print(clf.grid_scores_)

# With Pipeline
est = MyPipe([('svr', SVR(kernel="linear"))])
selector = feature_selection.RFE(est)
clf = GridSearchCV(selector, param_grid={'estimator__svr__C': [1, 10, 100]})
clf.fit(X, y)
print(clf.grid_scores_)
like image 690
hubi86 Avatar asked Apr 17 '16 23:04

hubi86


People also ask

Is Recursive feature Elimination good?

Recursive Feature Elimination, or RFE for short, is a popular feature selection algorithm. RFE is popular because it is easy to configure and use and because it is effective at selecting those features (columns) in a training dataset that are more or most relevant in predicting the target variable.

How does recursive feature elimination work?

Recursive feature elimination eliminates n features from a model by fitting the model multiple times and at each step, removing the weakest features, determined by either the coef_ or feature_importances_ attribute of the fitted model.

How does RFE ranking work?

RFE ranks features by the model's “coef” or “feature importances” attributes. It then recursively eliminates a minor number of features per loop, removing any existing dependencies and collinearities present in the model.


2 Answers

You have an issue with your use of pipeline.

A pipeline works as below:

first object is applied to data when you call .fit(x,y) etc. If that method exposes a .transform() method, this is applied and this output is used as the input for the next stage.

A pipeline can have any valid model as a final object, but all previous ones MUST expose a .transform() method.

Just like a pipe - you feed in data and each object in the pipeline takes the previous output and does another transform on it.

As we can see,

http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html#sklearn.feature_selection.RFE.fit_transform

RFE exposes a transform method, and so should be included in the pipeline itself. E.g.

some_sklearn_model=RandomForestClassifier()
selector = feature_selection.RFE(some_sklearn_model)
pipe_params = [('std_scaler', std_scaler), ('RFE', rfe),('clf', est)]

Your attempt has a few issues. Firstly, you are trying to scale a slice of your data. Imagine I had two partitions [1,1], [10,10]. If I normalize by the mean of the partition I lose the information that my second partition is significantly above the mean. You should scale at the start, not in the middle.

Secondly, SVR does not implement a transform method, you cannot incorporate it as a non final element in a pipeline.

RFE takes in a model which it fits to the data and then evaluates the weight of each feature.

EDIT:

You can include this behaviour if you wish, by wrapping the sklearn pipeline in your own class. What we want to do is when we fit the data, retrieve the last estimators .coef_ method and store that locally in our derived class under the correct name. I suggest you look into the sourcecode on github as this is only a first start and more error checking etc would probably be required. Sklearn uses a function decorator called @if_delegate_has_method which would be a handy thing to add to ensure the method generalises. I have run this code to make sure it works runs, but nothing more.

from sklearn.datasets import make_friedman1
from sklearn import feature_selection
from sklearn import preprocessing
from sklearn import pipeline
from sklearn.grid_search import GridSearchCV
from sklearn.svm import SVR

class myPipe(pipeline.Pipeline):

    def fit(self, X,y):
        """Calls last elements .coef_ method.
        Based on the sourcecode for decision_function(X).
        Link: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/pipeline.py
        ----------
        """

        super(myPipe, self).fit(X,y)

        self.coef_=self.steps[-1][-1].coef_
        return

X, y = make_friedman1(n_samples=50, n_features=10, random_state=0)

est = SVR(kernel="linear")

selector = feature_selection.RFE(est)
std_scaler = preprocessing.StandardScaler()
pipe_params = [('std_scaler', std_scaler),('select', selector), ('clf', est)]

pipe = myPipe(pipe_params)



selector = feature_selection.RFE(pipe)
clf = GridSearchCV(selector, param_grid={'estimator__clf__C': [2, 10]})
clf.fit(X, y)

print clf.best_params_

if anything is not clear, please ask.

like image 120
Chris Avatar answered Oct 11 '22 14:10

Chris


I think you had a slightly different way of constructing the pipeline than what was listed in the pipeline documentation.

Are you looking for this?

X, y = make_friedman1(n_samples=50, n_features=10, random_state=0)

est = SVR(kernel="linear")

std_scaler = preprocessing.StandardScaler()
selector = feature_selection.RFE(est)
pipe_params = [('feat_selection',selector),('std_scaler', std_scaler), ('clf', est)]
pipe = pipeline.Pipeline(pipe_params)

param_grid = dict(clf__C=[0.1, 1, 10])
clf = GridSearchCV(pipe, param_grid=param_grid, cv=2)
clf.fit(X, y)
print clf.grid_scores_

Also see this useful example for combining things in a pipeline. For the RFE object, I just used the official documentation for constructing it with your SVR estimator - I then just put the RFE object into the pipeline in the same way as you had done with the scaler and estimator objects.

like image 45
edesz Avatar answered Oct 11 '22 12:10

edesz