The label in my data is a (N by 1) vector. The label values are either 0 for negative samples or 1 for positive samples (so, it's a binary classification problem). I use the .fit
function of sklearn and fitted a random forest on my train set. To calculate AUC for the test set I use metrics.roc_auc_score (test_labels, probabilities)
. I'm using
predict_proba(my_test_set)
to get the probabilities. However, predict_proba(my_test_set)
returns a (N_test, 2) matrix. I saw many people use the second column of this returned matrix (predict_proba(my_test_set)[:,1]
) and feed it to the metrics.roc_auc_score
to calculate AUC, but why the second one? Why not the first column (predict_proba(my_test_set)[:,0]
)?
ROC AUC is calculated by comparing the true label vector with the probability prediction vector of the positive class.
All scikit-learn
classifiers, including RandomForestClassifier
, will set the class with the highest label to be the positive class, and the corresponding predicted probabilities will always be in the second column of the predict_proba
matrix. roc_auc_score
does the same assumption and also assumes the class with the highest label to be the positive class. Hence, both have the same definition of what the positive class is and roc_auc_score
expects the classifier to have put the corresponding probabilities in the second column beforehand.
This is why you should always do:
roc_auc_score(y_test, RFC.predict_proba(X_test)[:,1])
roc_auc_score()
would expect the y_true
be a binary indicator for the class and y_score
be the corresponding scores.
As in your case, y_true
is the binary indicator for positive class. For understanding, which column represent the probability score of which class, use clf.classes_
. In our examples, it would return array([0,1])
. Hence, we need to use the second column, to get the probability scores for class 1.
Even when you have multi-class problem, convert your labels (y
) into binary indicator for the required class and the pick the corresponding column from the output of predict_proba()
using clf.classes_
.
Look at this example for more details.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With