I have some models, using ROCR
package on a vector of the predicted class percentages, I have a performance object. Plotting the performance object with the specifications "tpr", "fpr" gives me a ROC curve.
I'm comparing models at certain thresholds of false positive rate (x). I'm hoping to get the value of the true positive rate (y) out of the performance object. Even more, I would like to get the class percentage threshold that was used to generate that point.
the index number of the false positive rate (x-value
) that is closest to the threshold without being above it, should give me the index number of the appropriate true positive rate (y-value
). I'm not exactly sure how to get that index value.
And more to the point, how do i get the threshold of class probabilities that was used to make that point?
A really easy way to pick a threshold is to take the median predicted values of the positive cases for a test set. This becomes your threshold. The threshold comes relatively close to the same threshold you would get by using the roc curve where true positive rate(tpr) and 1 - false positive rate(fpr) overlap.
In this case: The fair response threshold is F = 1450 + ((3000-1450) * 150 / 100) = 3775 ms. The bad response threshold is B = 1450 + ((3000-1450) * 300 / 100) = 6100 ms.
The ROC curve is produced by calculating and plotting the true positive rate against the false positive rate for a single classifier at a variety of thresholds. For example, in logistic regression, the threshold would be the predicted probability of an observation belonging to the positive class.
The ROC curve always ends at (1, 1) which corresponds to a threshold of 0. So, the threshold decreases as we move from (0, 0) to (1, 1). Let's take an approximate point (0.6, 0.98) on the curve.
This is why str
is my favorite R function:
library(ROCR)
data(ROCR.simple)
pred <- prediction( ROCR.simple$predictions, ROCR.simple$labels)
perf <- performance(pred,"tpr","fpr")
plot(perf)
> str(perf)
Formal class 'performance' [package "ROCR"] with 6 slots
..@ x.name : chr "False positive rate"
..@ y.name : chr "True positive rate"
..@ alpha.name : chr "Cutoff"
..@ x.values :List of 1
.. ..$ : num [1:201] 0 0 0 0 0.00935 ...
..@ y.values :List of 1
.. ..$ : num [1:201] 0 0.0108 0.0215 0.0323 0.0323 ...
..@ alpha.values:List of 1
.. ..$ : num [1:201] Inf 0.991 0.985 0.985 0.983 ...
Ahah! It's an S4 class, so we can use @
to access the slots. Here's how you make a data.frame
:
cutoffs <- data.frame([email protected][[1]], [email protected][[1]],
[email protected][[1]])
> head(cutoffs)
cut fpr tpr
1 Inf 0.000000000 0.00000000
2 0.9910964 0.000000000 0.01075269
3 0.9846673 0.000000000 0.02150538
4 0.9845992 0.000000000 0.03225806
5 0.9834944 0.009345794 0.03225806
6 0.9706413 0.009345794 0.04301075
If you have an fpr threshold you want to hit, you can subset this data.frame
to find maximum tpr below this fpr threshold:
cutoffs <- cutoffs[order(cutoffs$tpr, decreasing=TRUE),]
> head(subset(cutoffs, fpr < 0.2))
cut fpr tpr
96 0.5014893 0.1495327 0.8494624
97 0.4997881 0.1588785 0.8494624
98 0.4965132 0.1682243 0.8494624
99 0.4925969 0.1775701 0.8494624
100 0.4917356 0.1869159 0.8494624
101 0.4901199 0.1962617 0.8494624
Package pROC
includes function coords
for calculating best threshold:
library(pROC)
my_roc <- roc(my_response, my_predictor)
coords(my_roc, "best", ret = "threshold")
2 solutions based on the ROCR
and pROC
packages:
threshold1 <- function(predict, response) {
perf <- ROCR::performance(ROCR::prediction(predict, response), "sens", "spec")
df <- data.frame(cut = [email protected][[1]], sens = [email protected][[1]], spec = [email protected][[1]])
df[which.max(df$sens + df$spec), "cut"]
}
threshold2 <- function(predict, response) {
r <- pROC::roc(response, predict)
r$thresholds[which.max(r$sensitivities + r$specificities)]
}
data(ROCR.simple, package = "ROCR")
threshold1(ROCR.simple$predictions, ROCR.simple$labels)
#> [1] 0.5014893
threshold2(ROCR.simple$predictions, ROCR.simple$labels)
#> [1] 0.5006387
See also OptimalCutpoints
package which provides many algorithms to find an optimal thresholds.
Similar to @Artem's solution Basically the optimal threshold in a ROC curve is the widest part of the curve, or the point which gives maximum TPR while maintaining the lowest FPR FPR & TPR corresponding to best threshold - ROC curve
So one could also find the best threshold by finding the widest point or the point with maximum separation between TPR and FPR
Below is a quick solution using package ROSE
library(ROSE)
library(data.table)
threshold_data<-roc.curve(df$response,my_predictor,plotit = TRUE)
#Get TPR, FPR and corresponding threshold from roc.curve function and convert to dataframe
threshold_data<-data.frame(TPR = threshold_data$false.positive.rate,
FPR = threshold_data$true.positive.rate,
threshold = threshold_data$thresholds)
# TPR FPR threshold sep
# 1.0000000000 1.0000000 -Inf 0.0000000
# 0.7474009553 0.9820701 0.03405027 0.2346691
# 0.5869626300 0.9478403 0.08923265 0.3608776
# 0.4003933689 0.8777506 0.17368989 0.4773572
# 0.2225344198 0.7571312 0.25101859 0.5345968
# 0.1441416128 0.6495518 0.33035935 0.5054101
# 0.0868221411 0.5281174 0.44915920 0.4412952
# 0.0261309357 0.3390383 0.57857430 0.3129074
# 0.0089912897 0.2257539 0.76554635 0.2167626
# 0.0008429334 0.1140994 0.93730006 0.1132565
# 0.0000000000 0.0000000 Inf 0.0000000
threshold_data<-setDT(threshold_data)
threshold_data[,sep:=abs(FPR-TPR)]
best_threshold<-threshold_data[sep==max(sep),threshold]
#0.2510185
#Same result with package pROC
library(pROC)
my_curve <- roc(df$my_response,my_predictor)
coords(my_curve, "best", ret = "threshold")
#0.2510185
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With