Correct me if I'm wrong: the "thresholds" returned by scikit-learn's roc_curve should be an array of numbers that are in [0,1]. However, it sometimes gives me an array with the first number close to "2". Is it a bug or I did sth wrong? Thanks.
In [1]: import numpy as np
In [2]: from sklearn.metrics import roc_curve
In [3]: np.random.seed(11)
In [4]: aa = np.random.choice([True, False],100)
In [5]: bb = np.random.uniform(0,1,100)
In [6]: fpr,tpr,thresholds = roc_curve(aa,bb)
In [7]: thresholds
Out[7]:
array([ 1.97396826, 0.97396826, 0.9711752 , 0.95996265, 0.95744405,
0.94983331, 0.93290463, 0.93241372, 0.93214862, 0.93076592,
0.92960511, 0.92245024, 0.91179548, 0.91112166, 0.87529458,
0.84493853, 0.84068543, 0.83303741, 0.82565223, 0.81096657,
0.80656679, 0.79387241, 0.77054807, 0.76763223, 0.7644911 ,
0.75964947, 0.73995152, 0.73825262, 0.73466772, 0.73421299,
0.73282534, 0.72391126, 0.71296292, 0.70930102, 0.70116428,
0.69606617, 0.65869235, 0.65670881, 0.65261474, 0.6487222 ,
0.64805644, 0.64221486, 0.62699782, 0.62522484, 0.62283401,
0.61601839, 0.611632 , 0.59548669, 0.57555854, 0.56828967,
0.55652111, 0.55063947, 0.53885029, 0.53369398, 0.52157349,
0.51900774, 0.50547317, 0.49749635, 0.493913 , 0.46154029,
0.45275916, 0.44777116, 0.43822067, 0.43795921, 0.43624093,
0.42039077, 0.41866343, 0.41550367, 0.40032843, 0.36761763,
0.36642721, 0.36567017, 0.36148354, 0.35843793, 0.34371331,
0.33436415, 0.33408289, 0.33387442, 0.31887024, 0.31818719,
0.31367915, 0.30216469, 0.30097917, 0.29995201, 0.28604467,
0.26930354, 0.2383461 , 0.22803687, 0.21800338, 0.19301808,
0.16902881, 0.1688173 , 0.14491946, 0.13648451, 0.12704826,
0.09141459, 0.08569481, 0.07500199, 0.06288762, 0.02073298,
0.01934336])
thresholds[0] represents no instances being predicted and is arbitrarily set to max(y_score) + 1. Source: sklearn.metrics.roc_curve. Because it starts with the largest value in data + 1.
roc_curve() constructs the full ROC curve and returns a tibble. See roc_auc() for the area under the ROC curve.
ROC Curves and AUC in Python The AUC for the ROC can be calculated using the roc_auc_score() function. Like the roc_curve() function, the AUC function takes both the true outcomes (0,1) from the test set and the predicted probabilities for the 1 class.
Most of the time these thresholds are not used, for example in calculating the area under the curve, or plotting the False Positive Rate against the True Positive Rate.
Yet to plot what looks like a reasonable curve, one needs to have a threshold that incorporates 0 data points. Since Scikit-Learn's ROC curve function need not have normalised probabilities for thresholds (any score is fine), setting this point's threshold to 1 isn't sufficient; setting it to inf
is sensible but coders often expect finite data (and it's possible the implementation also works for integer thresholds). Instead the implementation uses max(score) + epsilon
where epsilon = 1
. This may be cosmetically deficient, but you haven't given any reason why it's a problem!
From the documentation:
thresholds : array, shape = [n_thresholds] Decreasing thresholds on the decision function used to compute fpr and tpr.
thresholds[0]
represents no instances being predicted and is arbitrarily set tomax(y_score) + 1
.
So the first element of thresholds
is close to 2 because it is max(y_score) + 1
, in your case thresholds[1] + 1
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With