Using a <code>LogisticRegression</code> class in <code>scikit-learn</code> on a version of the flight delay dataset. I use <code>pandas</code> to select some columns: <pre class="prettyprint lang-python prettyprint-override"><code>df = df[["MONTH", "DAY_OF_MONTH", "DAY_OF_WEEK", "ORIGIN", "DEST", "CRS_DEP_TIME", "ARR_DEL15"]] </code></pre> I fill in <code>NaN</code> values with 0: <pre class="prettyprint lang-python prettyprint-override"><code>df = df.fillna({'ARR_DEL15': 0}) </code></pre> Make sure the categorical columns are marked with the 'category' data type: <pre class="prettyprint lang-python prettyprint-override"><code>df["ORIGIN"] = df["ORIGIN"].astype('category') df["DEST"] = df["DEST"].astype('category') </code></pre> Then call <code>get_dummies()</code> from <code>pandas</code>: <pre class="prettyprint lang-python prettyprint-override"><code>df = pd.get_dummies(df) </code></pre> Now I train and test my data set: <pre class="prettyprint lang-python prettyprint-override"><code>from sklearn.linear_model import LogisticRegression lr = LogisticRegression() test_set, train_set = train_test_split(df, test_size=0.2, random_state=42) train_set_x = train_set.drop('ARR_DEL15', axis=1) train_set_y = train_set["ARR_DEL15"] test_set_x = test_set.drop('ARR_DEL15', axis=1) test_set_y = test_set["ARR_DEL15"] lr.fit(train_set_x, train_set_y) </code></pre> Once I call the <code>score</code> method I get around 0.867. However, when I call the <code>roc_auc_score</code> method I get a much lower number of around 0.583 <pre class="prettyprint lang-python prettyprint-override"><code> probabilities = lr.predict_proba(test_set_x) roc_auc_score(test_set_y, probabilities[:, 1]) </code></pre> Is there any reason why the ROC AUC is much lower than what the <code>score</code> method provides?

To start with, saying that an AUC of 0.583 is "lower" than a score* of 0.867 is exactly like comparing apples with oranges. [* I assume your score is mean accuracy, but this is not critical for this discussion - it could be anything else in principle] According to my experience at least, most ML practitioners think that the AUC score measures something different from what it actually does: the common (and unfortunate) use is just like any other the-higher-the-better metric, like accuracy, which may naturally lead to puzzles like the one you express yourself. The truth is that, roughly speaking, the AUC measures the performance of a binary classifier averaged across all possible decision thresholds. The (decision) threshold in binary classification is the value above which we decide to label a sample as 1 (recall that probabilistic classifiers actually return a value <code>p</code> in [0, 1], usually interpreted as a probability - in scikit-learn it is what <code>predict_proba</code> returns). Now, this threshold, in methods like scikit-learn <code>predict</code> which return labels (<code>1/0</code>), is set to 0.5 by default, but this is not the only possibility, and it may not even be desirable in come cases (imbalanced data, for example). The point to take home is that: <ul> <li>when you ask for <code>score</code> (which under the hood uses <code>predict</code>, i.e. labels and not probabilities), you have also implicitly set this threshold to 0.5</li> <li>when you ask for AUC (which, in contrast, uses probabilities returned with <code>predict_proba</code>), no threshold is involved, and you get (something like) the accuracy averaged across all possible thresholds</li> </ul> Given these clarifications, your particular example provides a very interesting case in point: <blockquote> I get a good-enough accuracy ~ 87% with my model; should I care that, according to an AUC of 0.58, my classifier does only slightly better than mere random guessing? </blockquote> Provided that the class representation in your data is reasonably balanced, the answer by now should hopefully be obvious: no, you should not care; for all practical cases, what you care for is a classifier deployed with a specific threshold, and what this classifier does in a purely theoretical and abstract situation when averaged across all possible thresholds should pose very little interest for a practitioner (it does pose interest for a researcher coming up with a new algorithm, but I assume that this is not your case). (For imbalanced data, the argument changes; accuracy here is practically useless, and you should consider precision, recall, and the confusion matrix instead). For this reason, AUC has started receiving serious criticism in the literature (don't misread this - the analysis of the ROC curve itself is highly informative and useful); the Wikipedia entry and the references provided therein are highly recommended reading: <blockquote> Thus, the practical value of the AUC measure has been called into question, raising the possibility that the AUC may actually introduce more uncertainty into machine learning classification accuracy comparisons than resolution. [...] One recent explanation of the problem with ROC AUC is that reducing the ROC Curve to a single number ignores the fact that it is about the tradeoffs between the different systems or performance points plotted and not the performance of an individual system </blockquote> Emphasis mine - see also On the dangers of AUC...

Getting a low ROC AUC score but a high accuracy

Tags:

machine-learning

classification

scikit-learn

logistic-regression

auc

Using a LogisticRegression class in scikit-learn on a version of the flight delay dataset.

I use pandas to select some columns:

df = df[["MONTH", "DAY_OF_MONTH", "DAY_OF_WEEK", "ORIGIN", "DEST", "CRS_DEP_TIME", "ARR_DEL15"]]

I fill in NaN values with 0:

df = df.fillna({'ARR_DEL15': 0})

Make sure the categorical columns are marked with the 'category' data type:

df["ORIGIN"] = df["ORIGIN"].astype('category')
df["DEST"] = df["DEST"].astype('category')

Then call get_dummies() from pandas:

df = pd.get_dummies(df)

Now I train and test my data set:

from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()

test_set, train_set = train_test_split(df, test_size=0.2, random_state=42)

train_set_x = train_set.drop('ARR_DEL15', axis=1)
train_set_y = train_set["ARR_DEL15"]

test_set_x = test_set.drop('ARR_DEL15', axis=1)
test_set_y = test_set["ARR_DEL15"]

lr.fit(train_set_x, train_set_y)

Once I call the score method I get around 0.867. However, when I call the roc_auc_score method I get a much lower number of around 0.583

 probabilities = lr.predict_proba(test_set_x)

 roc_auc_score(test_set_y, probabilities[:, 1])

Is there any reason why the ROC AUC is much lower than what the score method provides?

291

asked Nov 03 '17 20:11

Jon

1 Answers

To start with, saying that an AUC of 0.583 is "lower" than a score* of 0.867 is exactly like comparing apples with oranges.

[* I assume your score is mean accuracy, but this is not critical for this discussion - it could be anything else in principle]

According to my experience at least, most ML practitioners think that the AUC score measures something different from what it actually does: the common (and unfortunate) use is just like any other the-higher-the-better metric, like accuracy, which may naturally lead to puzzles like the one you express yourself.

The truth is that, roughly speaking, the AUC measures the performance of a binary classifier averaged across all possible decision thresholds.

The (decision) threshold in binary classification is the value above which we decide to label a sample as 1 (recall that probabilistic classifiers actually return a value p in [0, 1], usually interpreted as a probability - in scikit-learn it is what predict_proba returns).

Now, this threshold, in methods like scikit-learn predict which return labels (1/0), is set to 0.5 by default, but this is not the only possibility, and it may not even be desirable in come cases (imbalanced data, for example).

The point to take home is that:

when you ask for score (which under the hood uses predict, i.e. labels and not probabilities), you have also implicitly set this threshold to 0.5
when you ask for AUC (which, in contrast, uses probabilities returned with predict_proba), no threshold is involved, and you get (something like) the accuracy averaged across all possible thresholds

Given these clarifications, your particular example provides a very interesting case in point:

I get a good-enough accuracy ~ 87% with my model; should I care that, according to an AUC of 0.58, my classifier does only slightly better than mere random guessing?

Provided that the class representation in your data is reasonably balanced, the answer by now should hopefully be obvious: no, you should not care; for all practical cases, what you care for is a classifier deployed with a specific threshold, and what this classifier does in a purely theoretical and abstract situation when averaged across all possible thresholds should pose very little interest for a practitioner (it does pose interest for a researcher coming up with a new algorithm, but I assume that this is not your case).

(For imbalanced data, the argument changes; accuracy here is practically useless, and you should consider precision, recall, and the confusion matrix instead).

For this reason, AUC has started receiving serious criticism in the literature (don't misread this - the analysis of the ROC curve itself is highly informative and useful); the Wikipedia entry and the references provided therein are highly recommended reading:

Thus, the practical value of the AUC measure has been called into question, raising the possibility that the AUC may actually introduce more uncertainty into machine learning classification accuracy comparisons than resolution.

[...]

One recent explanation of the problem with ROC AUC is that reducing the ROC Curve to a single number ignores the fact that it is about the tradeoffs between the different systems or performance points plotted and not the performance of an individual system

Emphasis mine - see also On the dangers of AUC...

answered Oct 27 '22 20:10

desertnaut

Related questions
                            
                                What does global pooling do?
                            
                                Interpreting a Self Organizing Map
                            
                                Items of feature_columns must be a _FeatureColumn Given: _VocabularyListCategoricalColumn
                            
                                List the words in a vocabulary according to occurrence in a text corpus, with Scikit-Learn CountVectorizer
                            
                                sklearn LinearRegression, why only one coefficient returned by the model?
                            
                                What is the difference between normalisation and regularisation in machine learning
                            
                                In machine learning, what is definition of “downstream”?
                            
                                Neural Network Ordinal Classification for Age
                            
                                Stop Training in Keras when Accuracy is already 1.0
                            
                                Why does one not use IOU for training?
                            
                                What does "sparse" mean in the context of neural nets?
                            
                                Xavier and he_normal initialization difference
                            
                                Avoid certain parameter combinations in GridSearchCV
                            
                                Sci-kit learn how to print labels for confusion matrix?
                            
                                ValueError: Number of labels is 1. Valid values are 2 to n_samples - 1 (inclusive) when using silhouette_score
                            
                                How to correct unstable loss and accuracy during training? (binary classification)
                            
                                pytorch error: multi-target not supported in CrossEntropyLoss()
                            
                                Using sklearn voting ensemble with partial fit
                            
                                KL Divergence for two probability distributions in PyTorch
                            
                                Bayesian networks tutorial [closed]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With