scikit-learn roc_curve: why does it return a threshold value = 2 some time?

Tags:

Correct me if I'm wrong: the "thresholds" returned by scikit-learn's roc_curve should be an array of numbers that are in [0,1]. However, it sometimes gives me an array with the first number close to "2". Is it a bug or I did sth wrong? Thanks.

In [1]: import numpy as np

In [2]: from sklearn.metrics import roc_curve

In [3]: np.random.seed(11)

In [4]: aa = np.random.choice([True, False],100)

In [5]: bb = np.random.uniform(0,1,100)

In [6]: fpr,tpr,thresholds = roc_curve(aa,bb)

In [7]: thresholds
Out[7]: 
array([ 1.97396826,  0.97396826,  0.9711752 ,  0.95996265,  0.95744405,
    0.94983331,  0.93290463,  0.93241372,  0.93214862,  0.93076592,
    0.92960511,  0.92245024,  0.91179548,  0.91112166,  0.87529458,
    0.84493853,  0.84068543,  0.83303741,  0.82565223,  0.81096657,
    0.80656679,  0.79387241,  0.77054807,  0.76763223,  0.7644911 ,
    0.75964947,  0.73995152,  0.73825262,  0.73466772,  0.73421299,
    0.73282534,  0.72391126,  0.71296292,  0.70930102,  0.70116428,
    0.69606617,  0.65869235,  0.65670881,  0.65261474,  0.6487222 ,
    0.64805644,  0.64221486,  0.62699782,  0.62522484,  0.62283401,
    0.61601839,  0.611632  ,  0.59548669,  0.57555854,  0.56828967,
    0.55652111,  0.55063947,  0.53885029,  0.53369398,  0.52157349,
    0.51900774,  0.50547317,  0.49749635,  0.493913  ,  0.46154029,
    0.45275916,  0.44777116,  0.43822067,  0.43795921,  0.43624093,
    0.42039077,  0.41866343,  0.41550367,  0.40032843,  0.36761763,
    0.36642721,  0.36567017,  0.36148354,  0.35843793,  0.34371331,
    0.33436415,  0.33408289,  0.33387442,  0.31887024,  0.31818719,
    0.31367915,  0.30216469,  0.30097917,  0.29995201,  0.28604467,
    0.26930354,  0.2383461 ,  0.22803687,  0.21800338,  0.19301808,
    0.16902881,  0.1688173 ,  0.14491946,  0.13648451,  0.12704826,
    0.09141459,  0.08569481,  0.07500199,  0.06288762,  0.02073298,
    0.01934336])

441

asked Apr 21 '14 15:04

BlueFeet

2 Answers

Most of the time these thresholds are not used, for example in calculating the area under the curve, or plotting the False Positive Rate against the True Positive Rate.

Yet to plot what looks like a reasonable curve, one needs to have a threshold that incorporates 0 data points. Since Scikit-Learn's ROC curve function need not have normalised probabilities for thresholds (any score is fine), setting this point's threshold to 1 isn't sufficient; setting it to inf is sensible but coders often expect finite data (and it's possible the implementation also works for integer thresholds). Instead the implementation uses max(score) + epsilon where epsilon = 1. This may be cosmetically deficient, but you haven't given any reason why it's a problem!

188

answered Sep 29 '22 12:09

joeln

From the documentation:

thresholds : array, shape = [n_thresholds] Decreasing thresholds on the decision function used to compute fpr and tpr. thresholds[0] represents no instances being predicted and is arbitrarily set to max(y_score) + 1.

So the first element of thresholds is close to 2 because it is max(y_score) + 1, in your case thresholds[1] + 1.

answered Sep 29 '22 12:09

afrendeiro

Related questions
                            
                                In scikit learn, how to deal with the data mixed with numerical and nominal value?
                            
                                HOW TO LABEL the FEATURE IMPORTANCE with forests of trees?
                            
                                Is there a keras method to split data?
                            
                                inputs for nDCG in sklearn
                            
                                Saving an sklearn `FunctionTransformer` with the function it wraps
                            
                                predict_proba or decision_function as estimator "confidence"
                            
                                Comparison of R, statmodels, sklearn for a classification task with logistic regression
                            
                                Creating a threshold-coded ROC plot in Python
                            
                                Nu is infeasible
                            
                                Python loading old version of sklearn
                            
                                Cross-validation for grouped time-series (panel) data
                            
                                sklearn ImportError: No module named _check_build
                            
                                Calibration with xgboost
                            
                                Getting topic-word distribution from LDA in scikit learn
                            
                                Getting "ModuleNotFoundError: No module named 'sklearn.impute'" despite having latest sklearn installed (0.19.1)
                            
                                ImportError when importing metric from sklearn
                            
                                Ignore a column while building a model with SKLearn
                            
                                python warnings.filterwarnings does not ignore DeprecationWarning from 'import sklearn.ensemble'
                            
                                Computing separate tfidf scores for two different columns using sklearn
                            
                                Perform Chi-2 feature selection on TF and TF*IDF vectors

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With