I have carefully read the CARET documentation at: http://caret.r-forge.r-project.org/training.html, the vignettes, and everything is quite clear (the examples on the website help a lot!), but I am still a confused about the relationship between two arguments to trainControl
:
method
index
and the interplay between trainControl
and the data splitting functions in caret (e.g. createDataPartition
, createResample
, createFolds
and createMultiFolds
)
To better frame my questions, let me use the following example from the documentation:
data(BloodBrain)
set.seed(1)
tmp <- createDataPartition(logBBB,p = .8, times = 100)
trControl = trainControl(method = "LGOCV", index = tmp)
ctreeFit <- train(bbbDescr, logBBB, "ctree",trControl=trControl)
My questions are:
If I use createDataPartition
(which I assume that does stratified bootstrapping), as in the above example, and I pass the result as index
to trainControl
do I need to use LGOCV
as the method in my call trainControl
? If I use another one (e.g. cv
) What difference would it make? In my head, once you fix index
, you are essentially choosing the type of cross-validation, so I am not sure what role method
plays if you use index
.
What is the difference between createDataPartition
and createResample
? Is it that createDataPartition
does stratified bootstrapping, while createResample
doesn't?
3) How can I do stratified k-fold (e.g. 10 fold) cross validation using caret? Would the following do it?
tmp <- createFolds(logBBB, k=10, list=TRUE, times = 100)
trControl = trainControl(method = "cv", index = tmp)
ctreeFit <- train(bbbDescr, logBBB, "ctree",trControl=trControl)
You can use the trainControl function to specify a number of parameters (including sampling parameters) in your model. The object that is outputted from trainControl will be provided as an argument for train .
Data splitting is when data is divided into two or more subsets. Typically, with a two-part split, one part is used to evaluate or test the data and the other to train the model. Data splitting is an important aspect of data science, particularly for creating models based on data.
tuneLength = It allows system to tune algorithm automatically. It indicates the number of different values to try for each tunning parameter. For example, mtry for randomForest. Suppose, tuneLength = 5, it means try 5 different mtry values and find the optimal mtry value based on these 5 values.
If you are not sure what role method plays if you use index, why not to apply all the methods and compare results. It is a blind method of comparaison, but it can give you some intuitions.
methods <- c('boot', 'boot632', 'cv',
'repeatedcv', 'LOOCV', 'LGOCV')
I create my index:
n <- 100
tmp <- createDataPartition(logBBB,p = .8, times = n)
I apply trainControl
for my list of method, and I remove index from result since it is common to all my methods.
ll <- lapply(methods,function(x)
trControl = trainControl(method = x, index = tmp))
ll <- sapply(ll,'[<-','index', NULL)
Hence my ll is :
[,1] [,2] [,3] [,4] [,5] [,6]
method "boot" "boot632" "cv" "repeatedcv" "LOOCV" "LGOCV"
number 25 25 10 10 25 25
repeats 25 25 1 1 25 25
verboseIter FALSE FALSE FALSE FALSE FALSE FALSE
returnData TRUE TRUE TRUE TRUE TRUE TRUE
returnResamp "final" "final" "final" "final" "final" "final"
savePredictions FALSE FALSE FALSE FALSE FALSE FALSE
p 0.75 0.75 0.75 0.75 0.75 0.75
classProbs FALSE FALSE FALSE FALSE FALSE FALSE
summaryFunction ? ? ? ? ? ?
selectionFunction "best" "best" "best" "best" "best" "best"
preProcOptions List,3 List,3 List,3 List,3 List,3 List,3
custom NULL NULL NULL NULL NULL NULL
timingSamps 0 0 0 0 0 0
predictionBounds Logical,2 Logical,2 Logical,2 Logical,2 Logical,2 Logical,2
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With