I am trying to chain Grid Search and Recursive Feature Elimination in a Pipeline using scikit-learn.
GridSearchCV and RFE with "bare" classifier works fine:
from sklearn.datasets import make_friedman1
from sklearn import feature_selection
from sklearn.grid_search import GridSearchCV
from sklearn.svm import SVR
X, y = make_friedman1(n_samples=50, n_features=10, random_state=0)
est = SVR(kernel="linear")
selector = feature_selection.RFE(est)
param_grid = dict(estimator__C=[0.1, 1, 10])
clf = GridSearchCV(selector, param_grid=param_grid, cv=10)
clf.fit(X, y)
Putting classifier in a pipeline returns an error: RuntimeError: The classifier does not expose "coef_" or "feature_importances_" attributes
from sklearn.datasets import make_friedman1
from sklearn import feature_selection
from sklearn import preprocessing
from sklearn import pipeline
from sklearn.grid_search import GridSearchCV
from sklearn.svm import SVR
X, y = make_friedman1(n_samples=50, n_features=10, random_state=0)
est = SVR(kernel="linear")
std_scaler = preprocessing.StandardScaler()
pipe_params = [('std_scaler', std_scaler), ('clf', est)]
pipe = pipeline.Pipeline(pipe_params)
selector = feature_selection.RFE(pipe)
param_grid = dict(estimator__clf__C=[0.1, 1, 10])
clf = GridSearchCV(selector, param_grid=param_grid, cv=10)
clf.fit(X, y)
EDIT:
I have realised that I was not clear describing the problem. This is the clearer snippet:
from sklearn.datasets import make_friedman1
from sklearn import feature_selection
from sklearn import pipeline
from sklearn.grid_search import GridSearchCV
from sklearn.svm import SVR
X, y = make_friedman1(n_samples=50, n_features=10, random_state=0)
# This will work
est = SVR(kernel="linear")
selector = feature_selection.RFE(est)
clf = GridSearchCV(selector, param_grid={'estimator__C': [1, 10]})
clf.fit(X, y)
# This will not work
est = pipeline.make_pipeline(SVR(kernel="linear"))
selector = feature_selection.RFE(est)
clf = GridSearchCV(selector, param_grid={'estimator__svr__C': [1, 10]})
clf.fit(X, y)
As you can see, the only difference is putting the estimator in a pipeline. Pipeline, however, hides "coef_" or "feature_importances_" attributes. The questions are:
EDIT2:
Updated, working snippet based on the answer provided by @Chris
from sklearn.datasets import make_friedman1
from sklearn import feature_selection
from sklearn import pipeline
from sklearn.grid_search import GridSearchCV
from sklearn.svm import SVR
class MyPipe(pipeline.Pipeline):
def fit(self, X, y=None, **fit_params):
"""Calls last elements .coef_ method.
Based on the sourcecode for decision_function(X).
Link: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/pipeline.py
----------
"""
super(MyPipe, self).fit(X, y, **fit_params)
self.coef_ = self.steps[-1][-1].coef_
return self
X, y = make_friedman1(n_samples=50, n_features=10, random_state=0)
# Without Pipeline
est = SVR(kernel="linear")
selector = feature_selection.RFE(est)
clf = GridSearchCV(selector, param_grid={'estimator__C': [1, 10, 100]})
clf.fit(X, y)
print(clf.grid_scores_)
# With Pipeline
est = MyPipe([('svr', SVR(kernel="linear"))])
selector = feature_selection.RFE(est)
clf = GridSearchCV(selector, param_grid={'estimator__svr__C': [1, 10, 100]})
clf.fit(X, y)
print(clf.grid_scores_)
Recursive Feature Elimination, or RFE for short, is a popular feature selection algorithm. RFE is popular because it is easy to configure and use and because it is effective at selecting those features (columns) in a training dataset that are more or most relevant in predicting the target variable.
Recursive feature elimination eliminates n features from a model by fitting the model multiple times and at each step, removing the weakest features, determined by either the coef_ or feature_importances_ attribute of the fitted model.
RFE ranks features by the model's “coef” or “feature importances” attributes. It then recursively eliminates a minor number of features per loop, removing any existing dependencies and collinearities present in the model.
You have an issue with your use of pipeline.
A pipeline works as below:
first object is applied to data when you call .fit(x,y) etc. If that method exposes a .transform() method, this is applied and this output is used as the input for the next stage.
A pipeline can have any valid model as a final object, but all previous ones MUST expose a .transform() method.
Just like a pipe - you feed in data and each object in the pipeline takes the previous output and does another transform on it.
As we can see,
http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html#sklearn.feature_selection.RFE.fit_transform
RFE exposes a transform method, and so should be included in the pipeline itself. E.g.
some_sklearn_model=RandomForestClassifier()
selector = feature_selection.RFE(some_sklearn_model)
pipe_params = [('std_scaler', std_scaler), ('RFE', rfe),('clf', est)]
Your attempt has a few issues. Firstly, you are trying to scale a slice of your data. Imagine I had two partitions [1,1], [10,10]. If I normalize by the mean of the partition I lose the information that my second partition is significantly above the mean. You should scale at the start, not in the middle.
Secondly, SVR does not implement a transform method, you cannot incorporate it as a non final element in a pipeline.
RFE takes in a model which it fits to the data and then evaluates the weight of each feature.
EDIT:
You can include this behaviour if you wish, by wrapping the sklearn pipeline in your own class. What we want to do is when we fit the data, retrieve the last estimators .coef_ method and store that locally in our derived class under the correct name.
I suggest you look into the sourcecode on github as this is only a first start and more error checking etc would probably be required. Sklearn uses a function decorator called @if_delegate_has_method
which would be a handy thing to add to ensure the method generalises. I have run this code to make sure it works runs, but nothing more.
from sklearn.datasets import make_friedman1
from sklearn import feature_selection
from sklearn import preprocessing
from sklearn import pipeline
from sklearn.grid_search import GridSearchCV
from sklearn.svm import SVR
class myPipe(pipeline.Pipeline):
def fit(self, X,y):
"""Calls last elements .coef_ method.
Based on the sourcecode for decision_function(X).
Link: https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/pipeline.py
----------
"""
super(myPipe, self).fit(X,y)
self.coef_=self.steps[-1][-1].coef_
return
X, y = make_friedman1(n_samples=50, n_features=10, random_state=0)
est = SVR(kernel="linear")
selector = feature_selection.RFE(est)
std_scaler = preprocessing.StandardScaler()
pipe_params = [('std_scaler', std_scaler),('select', selector), ('clf', est)]
pipe = myPipe(pipe_params)
selector = feature_selection.RFE(pipe)
clf = GridSearchCV(selector, param_grid={'estimator__clf__C': [2, 10]})
clf.fit(X, y)
print clf.best_params_
if anything is not clear, please ask.
I think you had a slightly different way of constructing the pipeline than what was listed in the pipeline documentation.
Are you looking for this?
X, y = make_friedman1(n_samples=50, n_features=10, random_state=0)
est = SVR(kernel="linear")
std_scaler = preprocessing.StandardScaler()
selector = feature_selection.RFE(est)
pipe_params = [('feat_selection',selector),('std_scaler', std_scaler), ('clf', est)]
pipe = pipeline.Pipeline(pipe_params)
param_grid = dict(clf__C=[0.1, 1, 10])
clf = GridSearchCV(pipe, param_grid=param_grid, cv=2)
clf.fit(X, y)
print clf.grid_scores_
Also see this useful example for combining things in a pipeline. For the RFE
object, I just used the official documentation for constructing it with your SVR estimator - I then just put the RFE
object into the pipeline in the same way as you had done with the scaler and estimator objects.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With