Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to calculate AUC for random forest model in sklearn?

The label in my data is a (N by 1) vector. The label values are either 0 for negative samples or 1 for positive samples (so, it's a binary classification problem). I use the .fit function of sklearn and fitted a random forest on my train set. To calculate AUC for the test set I use metrics.roc_auc_score (test_labels, probabilities). I'm using predict_proba(my_test_set) to get the probabilities. However, predict_proba(my_test_set) returns a (N_test, 2) matrix. I saw many people use the second column of this returned matrix (predict_proba(my_test_set)[:,1]) and feed it to the metrics.roc_auc_score to calculate AUC, but why the second one? Why not the first column (predict_proba(my_test_set)[:,0])?

like image 564
khemedi Avatar asked Jun 26 '19 21:06

khemedi


2 Answers

ROC AUC is calculated by comparing the true label vector with the probability prediction vector of the positive class.

All scikit-learn classifiers, including RandomForestClassifier, will set the class with the highest label to be the positive class, and the corresponding predicted probabilities will always be in the second column of the predict_proba matrix. roc_auc_score does the same assumption and also assumes the class with the highest label to be the positive class. Hence, both have the same definition of what the positive class is and roc_auc_score expects the classifier to have put the corresponding probabilities in the second column beforehand.

This is why you should always do:

roc_auc_score(y_test, RFC.predict_proba(X_test)[:,1])
like image 87
MaximeKan Avatar answered Oct 08 '22 00:10

MaximeKan


roc_auc_score() would expect the y_true be a binary indicator for the class and y_score be the corresponding scores.

As in your case, y_true is the binary indicator for positive class. For understanding, which column represent the probability score of which class, use clf.classes_. In our examples, it would return array([0,1]). Hence, we need to use the second column, to get the probability scores for class 1.

Even when you have multi-class problem, convert your labels (y) into binary indicator for the required class and the pick the corresponding column from the output of predict_proba() using clf.classes_.

Look at this example for more details.

like image 25
Venkatachalam Avatar answered Oct 07 '22 23:10

Venkatachalam