Calibration with xgboost

Tags:

I'm wondering if I can do calibration in xgboost. To be more specific, does xgboost come with an existing calibration implementation like in scikit-learn, or are there some ways to put the model from xgboost into a scikit-learn's CalibratedClassifierCV?

As far as I know in sklearn this is the common procedure:

# Train random forest classifier, calibrate on validation data and evaluate
# on test data
clf = RandomForestClassifier(n_estimators=25)
clf.fit(X_train, y_train)
clf_probs = clf.predict_proba(X_test)
sig_clf = CalibratedClassifierCV(clf, method="sigmoid", cv="prefit")
sig_clf.fit(X_valid, y_valid)
sig_clf_probs = sig_clf.predict_proba(X_test)
sig_score = log_loss(y_test, sig_clf_probs)
print "Calibrated score is ",sig_score

If I put an xgboost tree model into the CalibratedClassifierCV an error will be thrown (of course):

RuntimeError: classifier has no decision_function or predict_proba method.

Is there a way to integrate the excellent calibration module of scikit-learn with xgboost?

Appreciate your insightful ideas!

936

asked Feb 23 '16 19:02

OrlandoL

2 Answers

A note from the hell scape that is July 2020:

You no longer need a wrapper class. The predict_proba method is built into the xgboost sklearn python apis. Not sure when they were added but they are there for v1.0.0 on for certain.

Note: this is of course only true for classes that would have the predict_proba method. Ex: The XGBRegressor doesn't. The XGBClassifier does.

answered Sep 20 '22 13:09

Robert Beatty

Answering to my own question, an xgboost GBT can be integrated with scikit-learn by writing a wrapper class like the case below.

class XGBoostClassifier():
def __init__(self, num_boost_round=10, **params):
    self.clf = None
    self.num_boost_round = num_boost_round
    self.params = params
    self.params.update({'objective': 'multi:softprob'})

def fit(self, X, y, num_boost_round=None):
    num_boost_round = num_boost_round or self.num_boost_round
    self.label2num = dict((label, i) for i, label in enumerate(sorted(set(y))))
    dtrain = xgb.DMatrix(X, label=[self.label2num[label] for label in y])
    self.clf = xgb.train(params=self.params, dtrain=dtrain, num_boost_round=num_boost_round)

def predict(self, X):
    num2label = dict((i, label)for label, i in self.label2num.items())
    Y = self.predict_proba(X)
    y = np.argmax(Y, axis=1)
    return np.array([num2label[i] for i in y])

def predict_proba(self, X):
    dtest = xgb.DMatrix(X)
    return self.clf.predict(dtest)

def score(self, X, y):
    Y = self.predict_proba(X)
    return 1 / logloss(y, Y)

def get_params(self, deep=True):
    return self.params

def set_params(self, **params):
    if 'num_boost_round' in params:
        self.num_boost_round = params.pop('num_boost_round')
    if 'objective' in params:
        del params['objective']
    self.params.update(params)
    return self

See full example here.

Please don't hesitate to provide a smarter way of doing this!

answered Sep 21 '22 13:09

OrlandoL

Related questions
                            
                                Text[Multi-Level] Classification with many outputs
                            
                                How to combine features with different dimensions output using scikit-learn
                            
                                Google Cloud ML-engine scikit-learn prediction probability 'predict_proba()'
                            
                                Why am I not seeing speed up via multiprocessing in Python?
                            
                                Why `sklearn` and `statsmodels` implementation of OLS regression give different R^2?
                            
                                Do I use the same Tfidf vocabulary in k-fold cross_validation
                            
                                Supervised Dimensionality Reduction for Text Data in scikit-learn
                            
                                In scikit learn, how to deal with the data mixed with numerical and nominal value?
                            
                                HOW TO LABEL the FEATURE IMPORTANCE with forests of trees?
                            
                                Is there a keras method to split data?
                            
                                inputs for nDCG in sklearn
                            
                                Saving an sklearn `FunctionTransformer` with the function it wraps
                            
                                predict_proba or decision_function as estimator "confidence"
                            
                                Comparison of R, statmodels, sklearn for a classification task with logistic regression
                            
                                Creating a threshold-coded ROC plot in Python
                            
                                Nu is infeasible
                            
                                Python loading old version of sklearn
                            
                                Cross-validation for grouped time-series (panel) data
                            
                                sklearn ImportError: No module named _check_build

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Calibration with xgboost

Tags:

scikit-learn

xgboost

OrlandoL

People also ask

2 Answers

Robert Beatty

OrlandoL

Recent Activity

Donate For Us