Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Comparing AUC, log loss and accuracy scores between models

I have the following evaluation metrics on the test set, after running 6 models for a binary classification problem:

  accuracy logloss   AUC
1   19%      0.45   0.54
2   67%      0.62   0.67
3   66%      0.63   0.68
4   67%      0.62   0.66
5   63%      0.61   0.66
6   65%      0.68   0.42

I have the following questions:

  • How can model 1 be the best in terms of logloss (the logloss is the closest to 0) since it performs the worst (in terms of accuracy). What does that mean ?
  • How come does model 6 have lower AUC score than e.g. model 5, when model 6 has better accuracy. What does that mean ?
  • Is there a way to say which of these 6 models is the best ?
like image 937
quant Avatar asked Oct 29 '19 15:10

quant


People also ask

Why AUC score is better than accuracy?

Accuracy is a very commonly used metric, even in the everyday life. In opposite to that, the AUC is used only when it's about classification problems with probabilities in order to analyze the prediction more deeply. Because of that, accuracy is understandable and intuitive even to a non-technical person.

What does the AUC score tell you?

The Area Under the Curve (AUC) is the measure of the ability of a classifier to distinguish between classes and is used as a summary of the ROC curve. The higher the AUC, the better the performance of the model at distinguishing between the positive and negative classes.

How accurate is AUC?

For a given choice of threshold, you can compute accuracy, which is the proportion of true positives and negatives in the whole data set. AUC measures how true positive rate (recall) and false positive rate trade off, so in that sense it is already measuring something else.

Why accuracy is not a good measure?

Even when model fails to predict any Crashes its accuracy is still 90%. As data contain 90% Landed Safely. So, accuracy does not holds good for imbalanced data. In business scenarios, most data won't be balanced and so accuracy becomes poor measure of evaluation for our classification model.


1 Answers

Very briefly, with links (as parts of this have already been discussed elsewhere)...

How can model 1 be the best in terms of logloss (the logloss is the closest to 0) since it performs the worst (in terms of accuracy). What does that mean ?

Although loss is a proxy for the accuracy (or vice versa), it is not a very reliable one in that matter. A closer look at the specific mechanics between accuracy and loss may be useful here; consider the following SO threads (disclaimer: answers are mine):

  • Loss & accuracy - Are these reasonable learning curves?
  • How does Keras evaluate the accuracy? (despite the title, it is a general exposition, and not confined to Keras in particular)

To elaborate a little:

Assuming a sample with true label y=1, a probabilistic prediction from the classifier of p=0.51, and a decision threshold of 0.5 (i.e. for p>0.5 we classify as 1, otherwise as 0), the contribution of this sample to the accuracy is 1/n (i.e. positive), while the loss is

-log(p) = -log(0.51) = 0.6733446

Now, assume another sample again with true y=1, but now with a probabilistic prediction of p=0.99; the contribution to the accuracy will be the same, while the loss now will be:

-log(p) = -log(0.99) = 0.01005034

So, for two samples that are both correctly classified (i.e. they contribute positively to the accuracy by the exact same quantity), we have a rather huge difference in the corresponding losses...

Although what you present here seems rather extreme, it shouldn't be difficult to imagine a situation where many samples of y=1 will be around the area of p=0.49, hence giving a relatively low loss but a zero contribution to accuracy nonetheless...

How come does model 6 have lower AUC score than e.g. model 5, when model 6 has better accuracy. What does that mean ?

This one is easier.

According to my experience at least, most ML practitioners think that the AUC score measures something different from what it actually does: the common (and unfortunate) use is just like any other the-higher-the-better metric, like accuracy, which may naturally lead to puzzles like the one you express yourself.

The truth is that, roughly speaking, the AUC measures the performance of a binary classifier averaged across all possible decision thresholds. So, the AUC does not actually measure the performance of a particular deployed model (which includes the chosen decision threshold), but the averaged performance of a family of models across all thresholds (the vast majority of which are of course of not interest to you, as they will be never used).

For this reason, AUC has started receiving serious criticism in the literature (don't misread this - the analysis of the ROC curve itself is highly informative and useful); the Wikipedia entry and the references provided therein are highly recommended reading:

Thus, the practical value of the AUC measure has been called into question, raising the possibility that the AUC may actually introduce more uncertainty into machine learning classification accuracy comparisons than resolution.

[...]

One recent explanation of the problem with ROC AUC is that reducing the ROC Curve to a single number ignores the fact that it is about the tradeoffs between the different systems or performance points plotted and not the performance of an individual system

Emphasis mine - see also On the dangers of AUC...

Simple advice: don't use it.

Is there a way to say which of these 6 models is the best ?

Depends of the exact definition of "best"; if "best" means best for my own business problem that I am trying to solve (not an irrational definition for an ML practitioner), then it is the one that performs better according to the business metric appropriate for your problem that you have defined yourself. This can never be the AUC, and normally it is also not the loss...

like image 113
desertnaut Avatar answered Sep 22 '22 05:09

desertnaut