Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

scikit-learn roc_curve: why does it return a threshold value = 2 some time?

Correct me if I'm wrong: the "thresholds" returned by scikit-learn's roc_curve should be an array of numbers that are in [0,1]. However, it sometimes gives me an array with the first number close to "2". Is it a bug or I did sth wrong? Thanks.

In [1]: import numpy as np

In [2]: from sklearn.metrics import roc_curve

In [3]: np.random.seed(11)

In [4]: aa = np.random.choice([True, False],100)

In [5]: bb = np.random.uniform(0,1,100)

In [6]: fpr,tpr,thresholds = roc_curve(aa,bb)

In [7]: thresholds
Out[7]: 
array([ 1.97396826,  0.97396826,  0.9711752 ,  0.95996265,  0.95744405,
    0.94983331,  0.93290463,  0.93241372,  0.93214862,  0.93076592,
    0.92960511,  0.92245024,  0.91179548,  0.91112166,  0.87529458,
    0.84493853,  0.84068543,  0.83303741,  0.82565223,  0.81096657,
    0.80656679,  0.79387241,  0.77054807,  0.76763223,  0.7644911 ,
    0.75964947,  0.73995152,  0.73825262,  0.73466772,  0.73421299,
    0.73282534,  0.72391126,  0.71296292,  0.70930102,  0.70116428,
    0.69606617,  0.65869235,  0.65670881,  0.65261474,  0.6487222 ,
    0.64805644,  0.64221486,  0.62699782,  0.62522484,  0.62283401,
    0.61601839,  0.611632  ,  0.59548669,  0.57555854,  0.56828967,
    0.55652111,  0.55063947,  0.53885029,  0.53369398,  0.52157349,
    0.51900774,  0.50547317,  0.49749635,  0.493913  ,  0.46154029,
    0.45275916,  0.44777116,  0.43822067,  0.43795921,  0.43624093,
    0.42039077,  0.41866343,  0.41550367,  0.40032843,  0.36761763,
    0.36642721,  0.36567017,  0.36148354,  0.35843793,  0.34371331,
    0.33436415,  0.33408289,  0.33387442,  0.31887024,  0.31818719,
    0.31367915,  0.30216469,  0.30097917,  0.29995201,  0.28604467,
    0.26930354,  0.2383461 ,  0.22803687,  0.21800338,  0.19301808,
    0.16902881,  0.1688173 ,  0.14491946,  0.13648451,  0.12704826,
    0.09141459,  0.08569481,  0.07500199,  0.06288762,  0.02073298,
    0.01934336])
like image 441
BlueFeet Avatar asked Apr 21 '14 15:04

BlueFeet


People also ask

What is threshold in roc_curve Sklearn?

thresholds[0] represents no instances being predicted and is arbitrarily set to max(y_score) + 1. Source: sklearn.metrics.roc_curve. Because it starts with the largest value in data + 1.

What does roc_curve return?

roc_curve() constructs the full ROC curve and returns a tibble. See roc_auc() for the area under the ROC curve.

How is ROC AUC score calculated in Python?

ROC Curves and AUC in Python The AUC for the ROC can be calculated using the roc_auc_score() function. Like the roc_curve() function, the AUC function takes both the true outcomes (0,1) from the test set and the predicted probabilities for the 1 class.


2 Answers

Most of the time these thresholds are not used, for example in calculating the area under the curve, or plotting the False Positive Rate against the True Positive Rate.

Yet to plot what looks like a reasonable curve, one needs to have a threshold that incorporates 0 data points. Since Scikit-Learn's ROC curve function need not have normalised probabilities for thresholds (any score is fine), setting this point's threshold to 1 isn't sufficient; setting it to inf is sensible but coders often expect finite data (and it's possible the implementation also works for integer thresholds). Instead the implementation uses max(score) + epsilon where epsilon = 1. This may be cosmetically deficient, but you haven't given any reason why it's a problem!

like image 188
joeln Avatar answered Sep 29 '22 12:09

joeln


From the documentation:

thresholds : array, shape = [n_thresholds] Decreasing thresholds on the decision function used to compute fpr and tpr. thresholds[0] represents no instances being predicted and is arbitrarily set to max(y_score) + 1.

So the first element of thresholds is close to 2 because it is max(y_score) + 1, in your case thresholds[1] + 1.

like image 30
afrendeiro Avatar answered Sep 29 '22 12:09

afrendeiro