Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Calibration with xgboost

I'm wondering if I can do calibration in xgboost. To be more specific, does xgboost come with an existing calibration implementation like in scikit-learn, or are there some ways to put the model from xgboost into a scikit-learn's CalibratedClassifierCV?

As far as I know in sklearn this is the common procedure:

# Train random forest classifier, calibrate on validation data and evaluate
# on test data
clf = RandomForestClassifier(n_estimators=25)
clf.fit(X_train, y_train)
clf_probs = clf.predict_proba(X_test)
sig_clf = CalibratedClassifierCV(clf, method="sigmoid", cv="prefit")
sig_clf.fit(X_valid, y_valid)
sig_clf_probs = sig_clf.predict_proba(X_test)
sig_score = log_loss(y_test, sig_clf_probs)
print "Calibrated score is ",sig_score

If I put an xgboost tree model into the CalibratedClassifierCV an error will be thrown (of course):

RuntimeError: classifier has no decision_function or predict_proba method.

Is there a way to integrate the excellent calibration module of scikit-learn with xgboost?

Appreciate your insightful ideas!

like image 936
OrlandoL Avatar asked Feb 23 '16 19:02

OrlandoL


People also ask

Does XGBoost need calibration?

No, any calibration by scikit-learn will distort the probabilities generated by XGBoost. Both isotonic and sigmoid calibration will make the results worse in every respect.

Is gradient boosting well calibrated?

Gradient Boosting Trees, which is non-linear, on the other hand, produces a very well calibrated class probabilities.

Does calibration improve AUC?

This will improve the model based on measures affected by the calibration, such as brier score or log score, but it will not affect measures that are not, such as AUC.

What is ML model calibration?

A machine learning model is calibrated if it produces calibrated probabilities. More specifically, probabilities are calibrated where a prediction of a class with confidence p is correct 100*p percent of the time.


2 Answers

A note from the hell scape that is July 2020:

You no longer need a wrapper class. The predict_proba method is built into the xgboost sklearn python apis. Not sure when they were added but they are there for v1.0.0 on for certain.

Note: this is of course only true for classes that would have the predict_proba method. Ex: The XGBRegressor doesn't. The XGBClassifier does.

like image 68
Robert Beatty Avatar answered Sep 20 '22 13:09

Robert Beatty


Answering to my own question, an xgboost GBT can be integrated with scikit-learn by writing a wrapper class like the case below.

class XGBoostClassifier():
def __init__(self, num_boost_round=10, **params):
    self.clf = None
    self.num_boost_round = num_boost_round
    self.params = params
    self.params.update({'objective': 'multi:softprob'})

def fit(self, X, y, num_boost_round=None):
    num_boost_round = num_boost_round or self.num_boost_round
    self.label2num = dict((label, i) for i, label in enumerate(sorted(set(y))))
    dtrain = xgb.DMatrix(X, label=[self.label2num[label] for label in y])
    self.clf = xgb.train(params=self.params, dtrain=dtrain, num_boost_round=num_boost_round)

def predict(self, X):
    num2label = dict((i, label)for label, i in self.label2num.items())
    Y = self.predict_proba(X)
    y = np.argmax(Y, axis=1)
    return np.array([num2label[i] for i in y])

def predict_proba(self, X):
    dtest = xgb.DMatrix(X)
    return self.clf.predict(dtest)

def score(self, X, y):
    Y = self.predict_proba(X)
    return 1 / logloss(y, Y)

def get_params(self, deep=True):
    return self.params

def set_params(self, **params):
    if 'num_boost_round' in params:
        self.num_boost_round = params.pop('num_boost_round')
    if 'objective' in params:
        del params['objective']
    self.params.update(params)
    return self

See full example here.

Please don't hesitate to provide a smarter way of doing this!

like image 45
OrlandoL Avatar answered Sep 21 '22 13:09

OrlandoL