Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

xgboost predict_proba : How to do the mapping between the probabilities and the labels

I'm trying to predict solve a multiclass classification using the xgboost algorithm, however i do not know how does predict_proba works exactly. In fact, predict_proba generates a list of probabilities but i don't know to which class each probability is related.

Here is a simple example:

This my train data:

+------------+----------+-------+
| feature1   | feature2 | label |
+------------+----------+-------+
|    x       |    z     |   3   |
+------------+----------+-------+
|    y       |    u     |   0   |
+------------+----------+-------+
|    x       |    u     |   2   |
+------------+----------+-------+

Then when I try to predict probas for a new example

model.predict_proba(['x','u'])

This will return something like this:

[0.2, 0.3, 0.5]

My question is : what is the class that has the probability of 0.5 ? is it the class 2, or 3 or 0 ?

like image 422
ABK Avatar asked Mar 29 '19 15:03

ABK


People also ask

What does model predict_proba () do in sklearn?

model. predict_proba() : For classification problems, some estimators also provide this method, which returns the probability that a new observation has each categorical label. In this case, the label with the highest probability is returned by model.

What is the difference between predict_proba and predict?

The predict method is used to predict the actual class while predict_proba method can be used to infer the class probabilities (i.e. the probability that a particular data point falls into the underlying classes).

How do I get predictions on XGBoost?

To make predictions we use the scikit-learn function model. predict(). By default, the predictions made by XGBoost are probabilities. Because this is a binary classification problem, each prediction is the probability of the input pattern belonging to the first class.

How is predict_proba calculated?

The predict_proba() returns the number of votes for each class, divided by the number of trees in the forest. Your precision is exactly 1/n_estimators. If you want to see variation at the 5th digit, you will need 10**5 = 100,000 estimators, which is excessive. You normally don't want more than 100 estimators.


1 Answers

It seems that you use the sklearn API of xgboost. In this case the model has a dedicated attribute model.classes_ that returns the classes that were learned by the model and the order of classes in the output array corresponds to the order of probabilities.

Here is an example with dummy data:

import numpy as np
import pandas as pd
import xgboost as xgb

# generate dummy data (10k examples, 10 numeric features, 4 classes of target)
np.random.seed(312)
train_X = np.random.random((10000,10))
train_y_mcc = np.random.randint(0, 4, train_X.shape[0]) #four classes:0,1,2,3

# model
xgb_model_mpg = xgb.XGBClassifier(max_depth= 3, n_estimators=100)
xgb_model_mpg.fit(train_X, train_y_mcc)

# classes
print(xgb_model_mpg.classes_)
>>> [0 1 2 3]
like image 123
Mischa Lisovyi Avatar answered Sep 28 '22 18:09

Mischa Lisovyi