Using a LogisticRegression
class in scikit-learn
on a version of the flight delay dataset.
I use pandas
to select some columns:
df = df[["MONTH", "DAY_OF_MONTH", "DAY_OF_WEEK", "ORIGIN", "DEST", "CRS_DEP_TIME", "ARR_DEL15"]]
I fill in NaN
values with 0:
df = df.fillna({'ARR_DEL15': 0})
Make sure the categorical columns are marked with the 'category' data type:
df["ORIGIN"] = df["ORIGIN"].astype('category')
df["DEST"] = df["DEST"].astype('category')
Then call get_dummies()
from pandas
:
df = pd.get_dummies(df)
Now I train and test my data set:
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
test_set, train_set = train_test_split(df, test_size=0.2, random_state=42)
train_set_x = train_set.drop('ARR_DEL15', axis=1)
train_set_y = train_set["ARR_DEL15"]
test_set_x = test_set.drop('ARR_DEL15', axis=1)
test_set_y = test_set["ARR_DEL15"]
lr.fit(train_set_x, train_set_y)
Once I call the score
method I get around 0.867. However, when I call the roc_auc_score
method I get a much lower number of around 0.583
probabilities = lr.predict_proba(test_set_x)
roc_auc_score(test_set_y, probabilities[:, 1])
Is there any reason why the ROC AUC is much lower than what the score
method provides?
Accuracy vs ROC AUC The first big difference is that you calculate accuracy on the predicted classes while you calculate ROC AUC on predicted scores. That means you will have to find the optimal threshold for your problem. Moreover, accuracy looks at fractions of correctly assigned positive and negative classes.
In this paper we establish rigourously that, even in this setting, the area under the ROC (Receiver Operating Characteristics) curve, or simply AUC, provides a better measure than accuracy.
AUC represents the probability that a random positive (green) example is positioned to the right of a random negative (red) example. AUC ranges in value from 0 to 1. A model whose predictions are 100% wrong has an AUC of 0.0; one whose predictions are 100% correct has an AUC of 1.0.
I stumbled upon a 3-class classification problem where all compared classifiers yield a higher AUC than accuracy (usually around 10% higher). This happens both when the dataset is balanced or slightly imbalanced.
To start with, saying that an AUC of 0.583 is "lower" than a score* of 0.867 is exactly like comparing apples with oranges.
[* I assume your score is mean accuracy, but this is not critical for this discussion - it could be anything else in principle]
According to my experience at least, most ML practitioners think that the AUC score measures something different from what it actually does: the common (and unfortunate) use is just like any other the-higher-the-better metric, like accuracy, which may naturally lead to puzzles like the one you express yourself.
The truth is that, roughly speaking, the AUC measures the performance of a binary classifier averaged across all possible decision thresholds.
The (decision) threshold in binary classification is the value above which we decide to label a sample as 1 (recall that probabilistic classifiers actually return a value p
in [0, 1], usually interpreted as a probability - in scikit-learn it is what predict_proba
returns).
Now, this threshold, in methods like scikit-learn predict
which return labels (1/0
), is set to 0.5 by default, but this is not the only possibility, and it may not even be desirable in come cases (imbalanced data, for example).
The point to take home is that:
score
(which under the hood uses predict
, i.e. labels and not probabilities), you have also implicitly set this threshold to 0.5predict_proba
), no threshold is involved, and you get (something like) the accuracy averaged across all possible thresholdsGiven these clarifications, your particular example provides a very interesting case in point:
I get a good-enough accuracy ~ 87% with my model; should I care that, according to an AUC of 0.58, my classifier does only slightly better than mere random guessing?
Provided that the class representation in your data is reasonably balanced, the answer by now should hopefully be obvious: no, you should not care; for all practical cases, what you care for is a classifier deployed with a specific threshold, and what this classifier does in a purely theoretical and abstract situation when averaged across all possible thresholds should pose very little interest for a practitioner (it does pose interest for a researcher coming up with a new algorithm, but I assume that this is not your case).
(For imbalanced data, the argument changes; accuracy here is practically useless, and you should consider precision, recall, and the confusion matrix instead).
For this reason, AUC has started receiving serious criticism in the literature (don't misread this - the analysis of the ROC curve itself is highly informative and useful); the Wikipedia entry and the references provided therein are highly recommended reading:
Thus, the practical value of the AUC measure has been called into question, raising the possibility that the AUC may actually introduce more uncertainty into machine learning classification accuracy comparisons than resolution.
[...]
One recent explanation of the problem with ROC AUC is that reducing the ROC Curve to a single number ignores the fact that it is about the tradeoffs between the different systems or performance points plotted and not the performance of an individual system
Emphasis mine - see also On the dangers of AUC...
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With