Comparing AUC, log loss and accuracy scores between models

Tags:

I have the following evaluation metrics on the test set, after running 6 models for a binary classification problem:

  accuracy logloss   AUC
1   19%      0.45   0.54
2   67%      0.62   0.67
3   66%      0.63   0.68
4   67%      0.62   0.66
5   63%      0.61   0.66
6   65%      0.68   0.42

I have the following questions:

How can model 1 be the best in terms of logloss (the logloss is the closest to 0) since it performs the worst (in terms of accuracy). What does that mean ?
How come does model 6 have lower AUC score than e.g. model 5, when model 6 has better accuracy. What does that mean ?
Is there a way to say which of these 6 models is the best ?

937

asked Oct 29 '19 15:10

quant

1 Answers

Very briefly, with links (as parts of this have already been discussed elsewhere)...

How can model 1 be the best in terms of logloss (the logloss is the closest to 0) since it performs the worst (in terms of accuracy). What does that mean ?

Although loss is a proxy for the accuracy (or vice versa), it is not a very reliable one in that matter. A closer look at the specific mechanics between accuracy and loss may be useful here; consider the following SO threads (disclaimer: answers are mine):

Loss & accuracy - Are these reasonable learning curves?
How does Keras evaluate the accuracy? (despite the title, it is a general exposition, and not confined to Keras in particular)

To elaborate a little:

Assuming a sample with true label y=1, a probabilistic prediction from the classifier of p=0.51, and a decision threshold of 0.5 (i.e. for p>0.5 we classify as 1, otherwise as 0), the contribution of this sample to the accuracy is 1/n (i.e. positive), while the loss is

-log(p) = -log(0.51) = 0.6733446

Now, assume another sample again with true y=1, but now with a probabilistic prediction of p=0.99; the contribution to the accuracy will be the same, while the loss now will be:

-log(p) = -log(0.99) = 0.01005034

So, for two samples that are both correctly classified (i.e. they contribute positively to the accuracy by the exact same quantity), we have a rather huge difference in the corresponding losses...

Although what you present here seems rather extreme, it shouldn't be difficult to imagine a situation where many samples of y=1 will be around the area of p=0.49, hence giving a relatively low loss but a zero contribution to accuracy nonetheless...

How come does model 6 have lower AUC score than e.g. model 5, when model 6 has better accuracy. What does that mean ?

This one is easier.

According to my experience at least, most ML practitioners think that the AUC score measures something different from what it actually does: the common (and unfortunate) use is just like any other the-higher-the-better metric, like accuracy, which may naturally lead to puzzles like the one you express yourself.

The truth is that, roughly speaking, the AUC measures the performance of a binary classifier averaged across all possible decision thresholds. So, the AUC does not actually measure the performance of a particular deployed model (which includes the chosen decision threshold), but the averaged performance of a family of models across all thresholds (the vast majority of which are of course of not interest to you, as they will be never used).

For this reason, AUC has started receiving serious criticism in the literature (don't misread this - the analysis of the ROC curve itself is highly informative and useful); the Wikipedia entry and the references provided therein are highly recommended reading:

Thus, the practical value of the AUC measure has been called into question, raising the possibility that the AUC may actually introduce more uncertainty into machine learning classification accuracy comparisons than resolution.

[...]

One recent explanation of the problem with ROC AUC is that reducing the ROC Curve to a single number ignores the fact that it is about the tradeoffs between the different systems or performance points plotted and not the performance of an individual system

Emphasis mine - see also On the dangers of AUC...

Simple advice: don't use it.

Is there a way to say which of these 6 models is the best ?

Depends of the exact definition of "best"; if "best" means best for my own business problem that I am trying to solve (not an irrational definition for an ML practitioner), then it is the one that performs better according to the business metric appropriate for your problem that you have defined yourself. This can never be the AUC, and normally it is also not the loss...

113

answered Sep 22 '22 05:09

desertnaut

Related questions
                            
                                How to build a Language model using LSTM that assigns probability of occurence for a given sentence
                            
                                Tensorflow.js tokenizer
                            
                                XGBoost Best Iteration
                            
                                Classification Report - Precision and F-score are ill-defined
                            
                                Is there some way to save best model only with tensorflow.estimator.train_and_evaluate()?
                            
                                In language modeling, why do I have to init_hidden weights before every new epoch of training? (pytorch)
                            
                                Single Perceptron - Non-linear Evaluating function
                            
                                Random Forests - Probability Estimates (+scikit-learn specific)
                            
                                Setting gamma and lambda in Reinforcement Learning
                            
                                GridSearchCV on LogisticRegression in scikit-learn
                            
                                data imbalance in SVM using libSVM
                            
                                Normalizing feature values for SVM
                            
                                Scikit-learn: Parallelize stochastic gradient descent
                            
                                Sentiment Analysis java Library [closed]
                            
                                Speed-efficient classification in Matlab
                            
                                Data Prediction using Decision Tree of rpart
                            
                                Plot confusion matrix sklearn with multiple labels
                            
                                Accuracy difference on normalization in KNN
                            
                                how to get covariance matrix in tensorflow?
                            
                                Plot loss evolution during a single epoch in Keras

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Comparing AUC, log loss and accuracy scores between models

Tags:

machine-learning

classification

loss

auc

quant

People also ask

1 Answers

desertnaut

Recent Activity

Donate For Us