I am trying to compute area under the ROC curve using sklearn.metrics.roc_auc_score
using the following method:
roc_auc = sklearn.metrics.roc_auc_score(actual, predicted)
where actual
is a binary vector with ground truth classification labels and predicted
is a binary vector with classification labels that my classifier has predicted.
However, the value of roc_auc
that I am getting is EXACTLY similar to accuracy values (proportion of samples whose labels are correctly predicted). This is not a one-off thing. I try my classifier on various values of the parameters and every time I get the same result.
What am I doing wrong here?
This is because you are passing in the decisions of you classifier instead of the scores it calculated. There was a question on this on SO recently and a related pull request to scikit-learn
.
The point of a ROC curve (and the area under it) is that you study the precision-recall tradeoff as the classification threshold is varied. By default in a binary classification task, if your classifier's score is > 0.5
, then class1
is predicted, otherwise class0
is predicted. As you change that threshold, you get a curve like this. The higher up the curve is (more area under it), the better that classifier. However, to get this curve you need access to the scores of a classifier, not its decisions. Otherwise whatever the decision threshold is, the decision stay the same, and AUC degenerates to accuracy.
Which classifier are you using?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With