I'm wondering if I can do calibration in xgboost. To be more specific, does xgboost come with an existing calibration implementation like in scikit-learn, or are there some ways to put the model from xgboost into a scikit-learn's CalibratedClassifierCV?
As far as I know in sklearn this is the common procedure:
# Train random forest classifier, calibrate on validation data and evaluate
# on test data
clf = RandomForestClassifier(n_estimators=25)
clf.fit(X_train, y_train)
clf_probs = clf.predict_proba(X_test)
sig_clf = CalibratedClassifierCV(clf, method="sigmoid", cv="prefit")
sig_clf.fit(X_valid, y_valid)
sig_clf_probs = sig_clf.predict_proba(X_test)
sig_score = log_loss(y_test, sig_clf_probs)
print "Calibrated score is ",sig_score
If I put an xgboost tree model into the CalibratedClassifierCV an error will be thrown (of course):
RuntimeError: classifier has no decision_function or predict_proba method.
Is there a way to integrate the excellent calibration module of scikit-learn with xgboost?
Appreciate your insightful ideas!
No, any calibration by scikit-learn will distort the probabilities generated by XGBoost. Both isotonic and sigmoid calibration will make the results worse in every respect.
Gradient Boosting Trees, which is non-linear, on the other hand, produces a very well calibrated class probabilities.
This will improve the model based on measures affected by the calibration, such as brier score or log score, but it will not affect measures that are not, such as AUC.
A machine learning model is calibrated if it produces calibrated probabilities. More specifically, probabilities are calibrated where a prediction of a class with confidence p is correct 100*p percent of the time.
A note from the hell scape that is July 2020:
You no longer need a wrapper class. The predict_proba method is built into the xgboost sklearn python apis. Not sure when they were added but they are there for v1.0.0 on for certain.
Note: this is of course only true for classes that would have the predict_proba method. Ex: The XGBRegressor doesn't. The XGBClassifier does.
Answering to my own question, an xgboost GBT can be integrated with scikit-learn by writing a wrapper class like the case below.
class XGBoostClassifier():
def __init__(self, num_boost_round=10, **params):
self.clf = None
self.num_boost_round = num_boost_round
self.params = params
self.params.update({'objective': 'multi:softprob'})
def fit(self, X, y, num_boost_round=None):
num_boost_round = num_boost_round or self.num_boost_round
self.label2num = dict((label, i) for i, label in enumerate(sorted(set(y))))
dtrain = xgb.DMatrix(X, label=[self.label2num[label] for label in y])
self.clf = xgb.train(params=self.params, dtrain=dtrain, num_boost_round=num_boost_round)
def predict(self, X):
num2label = dict((i, label)for label, i in self.label2num.items())
Y = self.predict_proba(X)
y = np.argmax(Y, axis=1)
return np.array([num2label[i] for i in y])
def predict_proba(self, X):
dtest = xgb.DMatrix(X)
return self.clf.predict(dtest)
def score(self, X, y):
Y = self.predict_proba(X)
return 1 / logloss(y, Y)
def get_params(self, deep=True):
return self.params
def set_params(self, **params):
if 'num_boost_round' in params:
self.num_boost_round = params.pop('num_boost_round')
if 'objective' in params:
del params['objective']
self.params.update(params)
return self
See full example here.
Please don't hesitate to provide a smarter way of doing this!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With