Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does caret train determine the probability threshold to maximise Specificity

Tags:

r

r-caret

I am using caret's twoClassSummary function to determine the optimal model hyper-parameters to maximise Specificity. However, how does the function determine the probability threshold that maximises Specificity?

Does caret essentially for each model hyper-parameter/fold evaluate every threshold between 0 and 1 and returns the maximum Specificity? In the example below you can see the model has landed on cp = 0.01492537.

# load libraries
library(caret)
library(mlbench)
# load the dataset
data(PimaIndiansDiabetes)
# prepare resampling method
control <- trainControl(method="cv", 
                        number=5, 
                        classProbs=TRUE,
                        summaryFunction=twoClassSummary)

set.seed(7)
fit <- train(diabetes~., 
             data=PimaIndiansDiabetes, 
             method="rpart", 
             tuneLength= 5,
             metric="Spec", 
             trControl=control)

print(fit)


CART 

768 samples
  8 predictor
  2 classes: 'neg', 'pos' 

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 614, 614, 615, 615, 614 
Resampling results across tuning parameters:

  cp          ROC        Sens   Spec     
  0.01305970  0.7615943  0.824  0.5937806
  0.01492537  0.7712055  0.824  0.6016073
  0.01741294  0.7544469  0.830  0.5976939
  0.10447761  0.6915783  0.866  0.5035639
  0.24253731  0.6437820  0.884  0.4035639

Spec was used to select the optimal model using  the largest value.
The final value used for the model was cp = 0.01492537.
like image 400
pmanDS Avatar asked Oct 19 '25 10:10

pmanDS


1 Answers

No, twoClassSummary does not evaluate every threshold between 0 and 1. It only returns the values for a standard-threshold of 0.5.

The twoClassSummary is defined as:

 function (data, lev = NULL, model = NULL) 
{
    lvls <- levels(data$obs)
    if (length(lvls) > 2) 
        stop(paste("Your outcome has", length(lvls), "levels. The twoClassSummary() function isn't appropriate."))
    requireNamespaceQuietStop("ModelMetrics")
    if (!all(levels(data[, "pred"]) == lvls)) 
        stop("levels of observed and predicted data do not match")
    rocAUC <- ModelMetrics::auc(ifelse(data$obs == lev[2], 0, 
        1), data[, lvls[1]])
    out <- c(rocAUC, sensitivity(data[, "pred"], data[, "obs"], 
        lev[1]), specificity(data[, "pred"], data[, "obs"], lev[2]))
    names(out) <- c("ROC", "Sens", "Spec")
    out
}

To verify my statement, try the following example with a custom summaryFunction where I excplicitly set the threshold to 0.5 and you will see that both values Spec(the original specificity reported by twoClassSummary) and Spec2(specificity with threshold manually set at 0.5) will be exactly the same:

# load libraries
library(caret)
library(mlbench)
# load the dataset
data(PimaIndiansDiabetes)

# define custom summaryFunction
customSummary <- function (data, lev = NULL, model = NULL){
  spec <- specificity(data[, "pred"], data[, "obs"], lev[2])
  pred <- factor(ifelse(data[, "neg"] > 0.5, "neg", "pos"))
  spec2 <- specificity(pred, data[, "obs"], "pos")
  out <- c(spec, spec2)

  names(out) <- c("Spec", "Spec2")
  out
}

# prepare resampling method
control <- trainControl(method="cv", 
                        number=5, 
                        classProbs=TRUE,
                        summaryFunction=customSummary)

set.seed(7)
fit <- train(diabetes~., 
             data=PimaIndiansDiabetes, 
             method="rpart", 
             tuneLength= 5,
             metric="Spec", 
             trControl=control)

print(fit)
CART 

768 samples
  8 predictor
  2 classes: 'neg', 'pos' 

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 615, 615, 614, 614, 614 
Resampling results across tuning parameters:

  cp          Spec       Spec2    
  0.01305970  0.5749825  0.5749825
  0.01492537  0.5411600  0.5411600
  0.01741294  0.5596785  0.5596785
  0.10447761  0.4932215  0.4932215
  0.24253731  0.2837177  0.2837177

Spec was used to select the optimal model using the largest value.
The final value used for the model was cp = 0.0130597.

In addition, if you would like caret to calculate the maximum specificity for a setting of hyperparameters for any threshold and report that value, you can define a custom summaryFunction like the following, which will try all thresholds from 0.1 to 0.95 in steps of 0.05:

    # define custom summaryFunction
customSummary <- function (data, lev = NULL, model = NULL){
  spec <- specificity(data[, "pred"], data[, "obs"], lev[2])
  pred <- factor(ifelse(data[, "neg"] > 0.5, "neg", "pos"))
  spec2 <- specificity(pred, data[, "obs"], "pos")
  speclist <- as.numeric()
  for(i in seq(0.1, 0.95, 0.05)){
    predi <- factor(ifelse(data[, "neg"] > i, "neg", "pos"))
    singlespec <- specificity(predi, data[, "obs"], "pos")
    speclist <- c(speclist, singlespec)
  }
  max(speclist) -> specmax

  out <- c(spec, spec2, specmax)

  names(out) <- c("Spec", "Spec2", "Specmax")
  out
}
like image 144
carpa_jo Avatar answered Oct 22 '25 00:10

carpa_jo



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!