I'm trying to predict solve a multiclass classification using the xgboost algorithm, however i do not know how does predict_proba
works exactly. In fact, predict_proba
generates a list of probabilities but i don't know to which class each probability is related.
Here is a simple example:
This my train data:
+------------+----------+-------+
| feature1 | feature2 | label |
+------------+----------+-------+
| x | z | 3 |
+------------+----------+-------+
| y | u | 0 |
+------------+----------+-------+
| x | u | 2 |
+------------+----------+-------+
Then when I try to predict probas for a new example
model.predict_proba(['x','u'])
This will return something like this:
[0.2, 0.3, 0.5]
My question is : what is the class that has the probability of 0.5 ? is it the class 2, or 3 or 0 ?
model. predict_proba() : For classification problems, some estimators also provide this method, which returns the probability that a new observation has each categorical label. In this case, the label with the highest probability is returned by model.
The predict method is used to predict the actual class while predict_proba method can be used to infer the class probabilities (i.e. the probability that a particular data point falls into the underlying classes).
To make predictions we use the scikit-learn function model. predict(). By default, the predictions made by XGBoost are probabilities. Because this is a binary classification problem, each prediction is the probability of the input pattern belonging to the first class.
The predict_proba() returns the number of votes for each class, divided by the number of trees in the forest. Your precision is exactly 1/n_estimators. If you want to see variation at the 5th digit, you will need 10**5 = 100,000 estimators, which is excessive. You normally don't want more than 100 estimators.
It seems that you use the sklearn API of xgboost. In this case the model has a dedicated attribute model.classes_
that returns the classes that were learned by the model and the order of classes in the output array corresponds to the order of probabilities.
Here is an example with dummy data:
import numpy as np
import pandas as pd
import xgboost as xgb
# generate dummy data (10k examples, 10 numeric features, 4 classes of target)
np.random.seed(312)
train_X = np.random.random((10000,10))
train_y_mcc = np.random.randint(0, 4, train_X.shape[0]) #four classes:0,1,2,3
# model
xgb_model_mpg = xgb.XGBClassifier(max_depth= 3, n_estimators=100)
xgb_model_mpg.fit(train_X, train_y_mcc)
# classes
print(xgb_model_mpg.classes_)
>>> [0 1 2 3]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With