Using the R package caret, how can I generate a ROC curve based on the cross-validation results of the train() function?
Say, I do the following:
data(Sonar) ctrl <- trainControl(method="cv", summaryFunction=twoClassSummary, classProbs=T) rfFit <- train(Class ~ ., data=Sonar, method="rf", preProc=c("center", "scale"), trControl=ctrl)
The training function goes over a range of mtry parameter and calculates the ROC AUC. I would like to see the associated ROC curve -- how do I do that?
Note: if the method used for sampling is LOOCV, then rfFit
will contain a non-null data frame in the rfFit$pred
slot, which seems to be exactly what I need. However, I need that for the "cv" method (k-fold validation) rather than LOO.
Also: no, roc
function that used to be included in former versions of caret is not an answer -- this is a low level function, you can't use it if you don't have the prediction probabilities for each cross-validated sample.
To make an ROC curve from your data you start by ranking all the values and linking each value to the diagnosis – sick or healthy. In the example in TABLE II 159 healthy people and 81 sick people are tested. The results and the diagnosis (sick Y or N) are listed and ranked based on parameter concentration.
Caret is a one-stop solution for machine learning in R. The R package caret has a powerful train function that allows you to fit over 230 different models using one syntax. There are over 230 models included in the package including various tree-based models, neural nets, deep learning and much more.
This 45 degree line is interpreted as the 'flip a coin' line. That is, if the ROC curve isn't far from this 45 degree line, you might as well flip a coin to categorize a binary response. So the further away from the 45 degree line, the ROC curve trace is, the more sensitive (better than flipping a coin) the model is.
There is just the savePredictions = TRUE
argument missing from ctrl
(this also works for other resampling methods):
library(caret) library(mlbench) data(Sonar) ctrl <- trainControl(method="cv", summaryFunction=twoClassSummary, classProbs=T, savePredictions = T) rfFit <- train(Class ~ ., data=Sonar, method="rf", preProc=c("center", "scale"), trControl=ctrl) library(pROC) # Select a parameter setting selectedIndices <- rfFit$pred$mtry == 2 # Plot: plot.roc(rfFit$pred$obs[selectedIndices], rfFit$pred$M[selectedIndices])
Maybe I am missing something, but a small concern is that train
always estimates slightly different AUC values than plot.roc
and pROC::auc
(absolute difference < 0.005), although twoClassSummary
uses pROC::auc
to estimate the AUC. Edit: I assume this occurs because the ROC from train
is the average of the AUC using the separate CV-Sets and here we are calculating the AUC over all resamples simultaneously to obtain the overall AUC.
Update Since this is getting a bit of attention, here's a solution using plotROC::geom_roc()
for ggplot2
:
library(ggplot2) library(plotROC) ggplot(rfFit$pred[selectedIndices, ], aes(m = M, d = factor(obs, levels = c("R", "M")))) + geom_roc(hjust = -0.4, vjust = 1.5) + coord_equal()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With