I'm trying to use XGBoost, and optimize the eval_metric
as auc
(as described here).
This works fine when using the classifier directly, but fails when I'm trying to use it as a pipeline.
What is the correct way to pass a .fit
argument to the sklearn pipeline?
Example:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_iris
from xgboost import XGBClassifier
import xgboost
import sklearn
print('sklearn version: %s' % sklearn.__version__)
print('xgboost version: %s' % xgboost.__version__)
X, y = load_iris(return_X_y=True)
# Without using the pipeline:
xgb = XGBClassifier()
xgb.fit(X, y, eval_metric='auc') # works fine
# Making a pipeline with this classifier and a scaler:
pipe = Pipeline([('scaler', StandardScaler()), ('classifier', XGBClassifier())])
# using the pipeline, but not optimizing for 'auc':
pipe.fit(X, y) # works fine
# however this does not work (even after correcting the underscores):
pipe.fit(X, y, classifier__eval_metric='auc') # fails
The error:TypeError: before_fit() got an unexpected keyword argument 'classifier__eval_metric'
Regarding the version of xgboost:xgboost.__version__
shows 0.6pip3 freeze | grep xgboost
shows xgboost==0.6a2
.
XGBoost works well with Scikit-Learn, has a similar API, and can in most cases be used just like a Scikit-Learn model - so it's natural to be able to build pipelines with both libraries.
XGBoost Python api provides a method to assess the incremental performance by the incremental number of trees. It uses two arguments: “eval_set” — usually Train and Test sets — and the associated “eval_metric” to measure your error on these evaluation sets.
The error is because you are using a single underscore between estimator name and its parameter when using in pipeline. It should be two underscores.
From the documentation of Pipeline.fit(), we see that the correct way of supplying params in fit:
Parameters passed to the fit method of each step, where each parameter name is prefixed such that parameter p for step s has key s__p.
So in your case, the correct usage is:
pipe.fit(X_train, y_train, classifier__eval_metric='auc')
(Notice two underscores between name and param)
When the goal is to optimize I suggest to use sklearn wrapper and GridSearchCV
from xgboost.sklearn import XGBClassifier
from sklearn.grid_search import GridSearchCV
It looks like
pipe = Pipeline([('scaler', StandardScaler()), ('classifier', XGBClassifier())])
score = 'roc_auc'
pipe.fit(X, y)
param = {
'classifier_max_depth':[1,2,3,4,5,6,7,8,9,10] # just as example
}
gsearch = GridSearchCV(estimator =pipe, param_grid =param , scoring= score)
Also you can use a technique of cross validation
gsearch.fit(X, y)
And you get the best params & the best scores
gsearch.best_params_, gsearch.best_score_
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With