Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to adjust probability threhold in XGBoost classifier when using Scikit-Learn API

I have a question about xgboost classifier with sklearn API. It seems it has a parameter to tell how much probability should be returned as True, but i can't find it.

Normally, xgb.predict would return boolean and xgb.predict_proba would return probability within interval [0,1]. I think the result is related. There should be a probability threshold to decide sample's class.

dtrain, dtest = train_test_split(data, test_size=0.1, random_state=22)

param_dict={'base_score': 0.5,
 'booster': 'gbtree',
 'colsample_bylevel': 1,
 'colsample_bytree': 1,
 'gamma': 0,
 'learning_rate': 0.1,
 'max_delta_step': 0,
 'max_depth': 4,
 'min_child_weight': 6,
 'missing': None,
 'n_estimators': 1000,
 'objective': 'binary:logistic',
 'reg_alpha': 0,
 'reg_lambda': 1,
 'scale_pos_weight': 1,
 'subsample': 1}

xgb = XGBClassifier(**param_dict,n_jobs=2)

xgb.fit(dtrain[features], dtrain['target'])

result_boolean = xgb.predict(dtest[features])
print(np.sum(result_boolean))
Output:936

result_proba = xgb.predict_proba(dtest[features])
result_boolean2= (result_proba[:,1] > 0.5) 
print(np.sum(result_boolean2))
Output:936

It looks like the default probability threshold is 0.5, so the result array has same amount of True. But I can't find where to adjust it in the code. predict(data, output_margin=False, ntree_limit=None, validate_features=True) Also, I have tested base_score, but it didn't affect the result.

The main reason I want to change probability threshold is that I want to test XGBClassifier with different probability threshold by GridSearchCV method. xgb.predict_proba seems like it can't be merged into GridSearchCV. How to change probability threshold in the XGBClassifier?

like image 980
劉金喜 Avatar asked Apr 10 '19 16:04

劉金喜


People also ask

Does XGBoost give probability?

All the most popular machine learning libraries in Python have a method called «predict_proba»: Scikit-learn (e.g. LogisticRegression, SVC, RandomForest, …), XGBoost, LightGBM, CatBoost, Keras… But, despite its name, «predict_proba» does not quite predict probabilities.

How do you set threshold value in machine learning?

We can select the best score from decision function output and set it as Decision Threshold value and consider all those Decision score values which are less than this Decision Threshold as a negative class ( 0 ) and all those decision score values that are greater than this Decision Threshold value as a positive class ...

How do you increase precision in XGBoost?

XGBoost can increase the model's accuracy score by using the best parameters during prediction. After initializing XGBoost, we can use it to train our model. Once again, we use the training set. The model learns from this dataset, stores the knowledge gained in memory, and uses this knowledge when making predictions.

Is XGBoost in Scikit-learn?

XGBoost is easy to implement in scikit-learn. XGBoost is an ensemble, so it scores better than individual models.


2 Answers

When you use ROC AUC (ROC=Receiver Operating Characteristic, AUC=Area Under Curve) as the scoring function, the gridsearch will be done with predict_proba(). The chosen classifier hyperparameter will be the one that has the best overall performance across all possible decision thresholds.

GridSearchCV(scoring='roc_auc', ....)

Then you can plot the ROC curve in order to determine the decision threshold that gives you the desired balance of precision vs. recall / true-positive vs. false-negative.

enter image description here

More info in scikit-learn documentation on ROC

like image 72
Jon Nordby Avatar answered Sep 21 '22 14:09

Jon Nordby


I think you should look at the source code to understand. I had troubles to find it, but I found as it works in lightgbm and I guess that xgboost should work similarly.

Go here (https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.LGBMClassifier.html#lightgbm.LGBMClassifier.predict) and look at the method "predict":

def predict(self, X, raw_score=False, num_iteration=None,
            pred_leaf=False, pred_contrib=False, **kwargs):
    """Docstring is inherited from the LGBMModel."""
    result = self.predict_proba(X, raw_score, num_iteration,
                                pred_leaf, pred_contrib, **kwargs)
    if callable(self._objective) or raw_score or pred_leaf or pred_contrib:
        return result
    else:
        class_index = np.argmax(result, axis=1)
        return self._le.inverse_transform(class_index)


predict.__doc__ = LGBMModel.predict.__doc__

Practically the classifier is trained as a multi-class classifier every time and it chooses the class that has a higher probability. The output of predict_proba is:

predicted_probability (array-like of shape = [n_samples, n_classes]) – The predicted probability for each class for each sample.

And you see that the method says:

class_index = np.argmax(result, axis=1)

Where "result" is the output of predict_proba. Now, if you have predict_proba for many classes do they sum to 1? I guess so, but I suppose we should go into the classifier loss function to really understand what is going on...

this is what I would read next! http://wiki.fast.ai/index.php/Log_Loss

like image 36
spec3 Avatar answered Sep 24 '22 14:09

spec3