I have a question about xgboost classifier with sklearn API. It seems it has a parameter to tell how much probability should be returned as True, but i can't find it.
Normally, xgb.predict
would return boolean and xgb.predict_proba
would return probability within interval [0,1]. I think the result is related. There should be a probability threshold to decide sample's class.
dtrain, dtest = train_test_split(data, test_size=0.1, random_state=22)
param_dict={'base_score': 0.5,
'booster': 'gbtree',
'colsample_bylevel': 1,
'colsample_bytree': 1,
'gamma': 0,
'learning_rate': 0.1,
'max_delta_step': 0,
'max_depth': 4,
'min_child_weight': 6,
'missing': None,
'n_estimators': 1000,
'objective': 'binary:logistic',
'reg_alpha': 0,
'reg_lambda': 1,
'scale_pos_weight': 1,
'subsample': 1}
xgb = XGBClassifier(**param_dict,n_jobs=2)
xgb.fit(dtrain[features], dtrain['target'])
result_boolean = xgb.predict(dtest[features])
print(np.sum(result_boolean))
Output:936
result_proba = xgb.predict_proba(dtest[features])
result_boolean2= (result_proba[:,1] > 0.5)
print(np.sum(result_boolean2))
Output:936
It looks like the default probability threshold is 0.5, so the result array has same amount of True. But I can't find where to adjust it in the code.
predict(data, output_margin=False, ntree_limit=None, validate_features=True)
Also, I have tested base_score
, but it didn't affect the result.
The main reason I want to change probability threshold is that I want to test XGBClassifier
with different probability threshold by GridSearchCV
method. xgb.predict_proba
seems like it can't be merged into GridSearchCV
. How to change probability threshold in the XGBClassifier
?
All the most popular machine learning libraries in Python have a method called «predict_proba»: Scikit-learn (e.g. LogisticRegression, SVC, RandomForest, …), XGBoost, LightGBM, CatBoost, Keras… But, despite its name, «predict_proba» does not quite predict probabilities.
We can select the best score from decision function output and set it as Decision Threshold value and consider all those Decision score values which are less than this Decision Threshold as a negative class ( 0 ) and all those decision score values that are greater than this Decision Threshold value as a positive class ...
XGBoost can increase the model's accuracy score by using the best parameters during prediction. After initializing XGBoost, we can use it to train our model. Once again, we use the training set. The model learns from this dataset, stores the knowledge gained in memory, and uses this knowledge when making predictions.
XGBoost is easy to implement in scikit-learn. XGBoost is an ensemble, so it scores better than individual models.
When you use ROC AUC (ROC=Receiver Operating Characteristic, AUC=Area Under Curve) as the scoring function, the gridsearch will be done with predict_proba(). The chosen classifier hyperparameter will be the one that has the best overall performance across all possible decision thresholds.
GridSearchCV(scoring='roc_auc', ....)
Then you can plot the ROC curve in order to determine the decision threshold that gives you the desired balance of precision vs. recall / true-positive vs. false-negative.
More info in scikit-learn documentation on ROC
I think you should look at the source code to understand. I had troubles to find it, but I found as it works in lightgbm and I guess that xgboost should work similarly.
Go here (https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.LGBMClassifier.html#lightgbm.LGBMClassifier.predict) and look at the method "predict":
def predict(self, X, raw_score=False, num_iteration=None,
pred_leaf=False, pred_contrib=False, **kwargs):
"""Docstring is inherited from the LGBMModel."""
result = self.predict_proba(X, raw_score, num_iteration,
pred_leaf, pred_contrib, **kwargs)
if callable(self._objective) or raw_score or pred_leaf or pred_contrib:
return result
else:
class_index = np.argmax(result, axis=1)
return self._le.inverse_transform(class_index)
predict.__doc__ = LGBMModel.predict.__doc__
Practically the classifier is trained as a multi-class classifier every time and it chooses the class that has a higher probability. The output of predict_proba is:
predicted_probability (array-like of shape = [n_samples, n_classes]) – The predicted probability for each class for each sample.
And you see that the method says:
class_index = np.argmax(result, axis=1)
Where "result" is the output of predict_proba. Now, if you have predict_proba for many classes do they sum to 1? I guess so, but I suppose we should go into the classifier loss function to really understand what is going on...
this is what I would read next! http://wiki.fast.ai/index.php/Log_Loss
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With