Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Getting a low ROC AUC score but a high accuracy

Using a LogisticRegression class in scikit-learn on a version of the flight delay dataset.

I use pandas to select some columns:

df = df[["MONTH", "DAY_OF_MONTH", "DAY_OF_WEEK", "ORIGIN", "DEST", "CRS_DEP_TIME", "ARR_DEL15"]]

I fill in NaN values with 0:

df = df.fillna({'ARR_DEL15': 0})

Make sure the categorical columns are marked with the 'category' data type:

df["ORIGIN"] = df["ORIGIN"].astype('category')
df["DEST"] = df["DEST"].astype('category')

Then call get_dummies() from pandas:

df = pd.get_dummies(df)

Now I train and test my data set:

from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()

test_set, train_set = train_test_split(df, test_size=0.2, random_state=42)

train_set_x = train_set.drop('ARR_DEL15', axis=1)
train_set_y = train_set["ARR_DEL15"]

test_set_x = test_set.drop('ARR_DEL15', axis=1)
test_set_y = test_set["ARR_DEL15"]

lr.fit(train_set_x, train_set_y)

Once I call the score method I get around 0.867. However, when I call the roc_auc_score method I get a much lower number of around 0.583

 probabilities = lr.predict_proba(test_set_x)

 roc_auc_score(test_set_y, probabilities[:, 1])

Is there any reason why the ROC AUC is much lower than what the score method provides?

like image 291
Jon Avatar asked Nov 03 '17 20:11

Jon


People also ask

Is ROC AUC same as accuracy?

Accuracy vs ROC AUC The first big difference is that you calculate accuracy on the predicted classes while you calculate ROC AUC on predicted scores. That means you will have to find the optimal threshold for your problem. Moreover, accuracy looks at fractions of correctly assigned positive and negative classes.

Is ROC AUC better than accuracy?

In this paper we establish rigourously that, even in this setting, the area under the ROC (Receiver Operating Characteristics) curve, or simply AUC, provides a better measure than accuracy.

How is AUC related to accuracy?

AUC represents the probability that a random positive (green) example is positioned to the right of a random negative (red) example. AUC ranges in value from 0 to 1. A model whose predictions are 100% wrong has an AUC of 0.0; one whose predictions are 100% correct has an AUC of 1.0.

Can AUC be higher than accuracy?

I stumbled upon a 3-class classification problem where all compared classifiers yield a higher AUC than accuracy (usually around 10% higher). This happens both when the dataset is balanced or slightly imbalanced.


1 Answers

To start with, saying that an AUC of 0.583 is "lower" than a score* of 0.867 is exactly like comparing apples with oranges.

[* I assume your score is mean accuracy, but this is not critical for this discussion - it could be anything else in principle]

According to my experience at least, most ML practitioners think that the AUC score measures something different from what it actually does: the common (and unfortunate) use is just like any other the-higher-the-better metric, like accuracy, which may naturally lead to puzzles like the one you express yourself.

The truth is that, roughly speaking, the AUC measures the performance of a binary classifier averaged across all possible decision thresholds.

The (decision) threshold in binary classification is the value above which we decide to label a sample as 1 (recall that probabilistic classifiers actually return a value p in [0, 1], usually interpreted as a probability - in scikit-learn it is what predict_proba returns).

Now, this threshold, in methods like scikit-learn predict which return labels (1/0), is set to 0.5 by default, but this is not the only possibility, and it may not even be desirable in come cases (imbalanced data, for example).

The point to take home is that:

  • when you ask for score (which under the hood uses predict, i.e. labels and not probabilities), you have also implicitly set this threshold to 0.5
  • when you ask for AUC (which, in contrast, uses probabilities returned with predict_proba), no threshold is involved, and you get (something like) the accuracy averaged across all possible thresholds

Given these clarifications, your particular example provides a very interesting case in point:

I get a good-enough accuracy ~ 87% with my model; should I care that, according to an AUC of 0.58, my classifier does only slightly better than mere random guessing?

Provided that the class representation in your data is reasonably balanced, the answer by now should hopefully be obvious: no, you should not care; for all practical cases, what you care for is a classifier deployed with a specific threshold, and what this classifier does in a purely theoretical and abstract situation when averaged across all possible thresholds should pose very little interest for a practitioner (it does pose interest for a researcher coming up with a new algorithm, but I assume that this is not your case).

(For imbalanced data, the argument changes; accuracy here is practically useless, and you should consider precision, recall, and the confusion matrix instead).

For this reason, AUC has started receiving serious criticism in the literature (don't misread this - the analysis of the ROC curve itself is highly informative and useful); the Wikipedia entry and the references provided therein are highly recommended reading:

Thus, the practical value of the AUC measure has been called into question, raising the possibility that the AUC may actually introduce more uncertainty into machine learning classification accuracy comparisons than resolution.

[...]

One recent explanation of the problem with ROC AUC is that reducing the ROC Curve to a single number ignores the fact that it is about the tradeoffs between the different systems or performance points plotted and not the performance of an individual system

Emphasis mine - see also On the dangers of AUC...

like image 98
desertnaut Avatar answered Oct 27 '22 20:10

desertnaut