How to adjust probability threhold in XGBoost classifier when using Scikit-Learn API

Tags:

I have a question about xgboost classifier with sklearn API. It seems it has a parameter to tell how much probability should be returned as True, but i can't find it.

Normally, xgb.predict would return boolean and xgb.predict_proba would return probability within interval [0,1]. I think the result is related. There should be a probability threshold to decide sample's class.

Click to copy

dtrain, dtest = train_test_split(data, test_size=0.1, random_state=22)

param_dict={'base_score': 0.5,
 'booster': 'gbtree',
 'colsample_bylevel': 1,
 'colsample_bytree': 1,
 'gamma': 0,
 'learning_rate': 0.1,
 'max_delta_step': 0,
 'max_depth': 4,
 'min_child_weight': 6,
 'missing': None,
 'n_estimators': 1000,
 'objective': 'binary:logistic',
 'reg_alpha': 0,
 'reg_lambda': 1,
 'scale_pos_weight': 1,
 'subsample': 1}

xgb = XGBClassifier(**param_dict,n_jobs=2)

xgb.fit(dtrain[features], dtrain['target'])

result_boolean = xgb.predict(dtest[features])
print(np.sum(result_boolean))
Output:936

result_proba = xgb.predict_proba(dtest[features])
result_boolean2= (result_proba[:,1] > 0.5) 
print(np.sum(result_boolean2))
Output:936

It looks like the default probability threshold is 0.5, so the result array has same amount of True. But I can't find where to adjust it in the code. predict(data, output_margin=False, ntree_limit=None, validate_features=True) Also, I have tested base_score, but it didn't affect the result.

The main reason I want to change probability threshold is that I want to test XGBClassifier with different probability threshold by GridSearchCV method. xgb.predict_proba seems like it can't be merged into GridSearchCV. How to change probability threshold in the XGBClassifier?

980

asked Apr 10 '19 16:04

劉金喜

2 Answers

When you use ROC AUC (ROC=Receiver Operating Characteristic, AUC=Area Under Curve) as the scoring function, the gridsearch will be done with predict_proba(). The chosen classifier hyperparameter will be the one that has the best overall performance across all possible decision thresholds.

GridSearchCV(scoring='roc_auc', ....)

Then you can plot the ROC curve in order to determine the decision threshold that gives you the desired balance of precision vs. recall / true-positive vs. false-negative.

enter image description here

More info in scikit-learn documentation on ROC

answered Sep 21 '22 14:09

Jon Nordby

I think you should look at the source code to understand. I had troubles to find it, but I found as it works in lightgbm and I guess that xgboost should work similarly.

Go here (https://lightgbm.readthedocs.io/en/latest/pythonapi/lightgbm.LGBMClassifier.html#lightgbm.LGBMClassifier.predict) and look at the method "predict":

Click to copy

def predict(self, X, raw_score=False, num_iteration=None,
            pred_leaf=False, pred_contrib=False, **kwargs):
    """Docstring is inherited from the LGBMModel."""
    result = self.predict_proba(X, raw_score, num_iteration,
                                pred_leaf, pred_contrib, **kwargs)
    if callable(self._objective) or raw_score or pred_leaf or pred_contrib:
        return result
    else:
        class_index = np.argmax(result, axis=1)
        return self._le.inverse_transform(class_index)


predict.__doc__ = LGBMModel.predict.__doc__

Practically the classifier is trained as a multi-class classifier every time and it chooses the class that has a higher probability. The output of predict_proba is:

predicted_probability (array-like of shape = [n_samples, n_classes]) – The predicted probability for each class for each sample.

And you see that the method says:

Click to copy

class_index = np.argmax(result, axis=1)

Where "result" is the output of predict_proba. Now, if you have predict_proba for many classes do they sum to 1? I guess so, but I suppose we should go into the classifier loss function to really understand what is going on...

this is what I would read next! http://wiki.fast.ai/index.php/Log_Loss

answered Sep 24 '22 14:09

spec3

Related questions
                            
                                Python how to mock a function within another function
                            
                                Dependency between "Session/line number was not unique in database." error and Python code
                            
                                Variables starting with underscore for property decorator
                            
                                Python flask saml throwing saml2.sigver.SigverError Error Message
                            
                                Python Bloomberg API request does not return result
                            
                                XGBoost error - Unknown objective function reg:squarederror
                            
                                Python 'raise' without arguments: what is "the last exception that was active in the current scope"?
                            
                                How to implement an interface in a way that is compatible with static type checks?
                            
                                What is the time complexity of type casting function in python?
                            
                                What are prevalent techniques for enabling user code extensions in Python?
                            
                                How to achieve the functionality of UserDict.DictMixin in Python 3?
                            
                                Python3: Looking for alternatives to gevent and pylibmc/python-memcached
                            
                                Python3 multiple assignment and memory address [duplicate]
                            
                                is there a pythonics way to distinguish Sequences objects like "tuple and list" from Sequence objects like "bytes and str"
                            
                                Why are 2 of the 6 built-in constants assignable?
                            
                                XML submitted just fine to Amazon MWS but price not being updated
                            
                                Connect to S3 accelerate endpoint with boto3
                            
                                Bundling Python3 packages for PySpark results in missing imports
                            
                                How to pass the arguments to the new_callable from mock.patch?
                            
                                Run and wait for asynchronous function from a synchronous one using Python asyncio

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to adjust probability threhold in XGBoost classifier when using Scikit-Learn API

Tags:

python-3.x

scikit-learn

xgboost

劉金喜

People also ask

2 Answers

Jon Nordby

spec3

Recent Activity

Donate For Us