Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

CARET. Relationship between data splitting and trainControl

I have carefully read the CARET documentation at: http://caret.r-forge.r-project.org/training.html, the vignettes, and everything is quite clear (the examples on the website help a lot!), but I am still a confused about the relationship between two arguments to trainControl:

method 
index

and the interplay between trainControl and the data splitting functions in caret (e.g. createDataPartition, createResample, createFolds and createMultiFolds)

To better frame my questions, let me use the following example from the documentation:

data(BloodBrain)
set.seed(1)
tmp <- createDataPartition(logBBB,p = .8, times = 100)
trControl = trainControl(method = "LGOCV", index = tmp)
ctreeFit <- train(bbbDescr, logBBB, "ctree",trControl=trControl)

My questions are:

  1. If I use createDataPartition (which I assume that does stratified bootstrapping), as in the above example, and I pass the result as index to trainControl do I need to use LGOCV as the method in my call trainControl? If I use another one (e.g. cv) What difference would it make? In my head, once you fix index, you are essentially choosing the type of cross-validation, so I am not sure what role method plays if you use index.

  2. What is the difference between createDataPartition and createResample? Is it that createDataPartition does stratified bootstrapping, while createResample doesn't?

3) How can I do stratified k-fold (e.g. 10 fold) cross validation using caret? Would the following do it?

tmp <- createFolds(logBBB, k=10, list=TRUE,  times = 100)
trControl = trainControl(method = "cv", index = tmp)
ctreeFit <- train(bbbDescr, logBBB, "ctree",trControl=trControl)
like image 855
Amelio Vazquez-Reina Avatar asked Feb 19 '13 22:02

Amelio Vazquez-Reina


People also ask

What does trainControl do in R?

You can use the trainControl function to specify a number of parameters (including sampling parameters) in your model. The object that is outputted from trainControl will be provided as an argument for train .

What is data splitting?

Data splitting is when data is divided into two or more subsets. Typically, with a two-part split, one part is used to evaluate or test the data and the other to train the model. Data splitting is an important aspect of data science, particularly for creating models based on data.

What is tuneLength in random forest?

tuneLength = It allows system to tune algorithm automatically. It indicates the number of different values to try for each tunning parameter. For example, mtry for randomForest. Suppose, tuneLength = 5, it means try 5 different mtry values and find the optimal mtry value based on these 5 values.


1 Answers

If you are not sure what role method plays if you use index, why not to apply all the methods and compare results. It is a blind method of comparaison, but it can give you some intuitions.

  methods <- c('boot', 'boot632', 'cv', 
               'repeatedcv', 'LOOCV', 'LGOCV')

I create my index:

  n <- 100
  tmp <- createDataPartition(logBBB,p = .8, times = n)

I apply trainControl for my list of method, and I remove index from result since it is common to all my methods.

ll <- lapply(methods,function(x)
         trControl = trainControl(method = x, index = tmp))
ll <- sapply(ll,'[<-','index', NULL)

Hence my ll is :

                 [,1]      [,2]      [,3]      [,4]         [,5]      [,6]     
method            "boot"    "boot632" "cv"      "repeatedcv" "LOOCV"   "LGOCV"  
number            25        25        10        10           25        25       
repeats           25        25        1         1            25        25       
verboseIter       FALSE     FALSE     FALSE     FALSE        FALSE     FALSE    
returnData        TRUE      TRUE      TRUE      TRUE         TRUE      TRUE     
returnResamp      "final"   "final"   "final"   "final"      "final"   "final"  
savePredictions   FALSE     FALSE     FALSE     FALSE        FALSE     FALSE    
p                 0.75      0.75      0.75      0.75         0.75      0.75     
classProbs        FALSE     FALSE     FALSE     FALSE        FALSE     FALSE    
summaryFunction   ?         ?         ?         ?            ?         ?        
selectionFunction "best"    "best"    "best"    "best"       "best"    "best"   
preProcOptions    List,3    List,3    List,3    List,3       List,3    List,3   
custom            NULL      NULL      NULL      NULL         NULL      NULL     
timingSamps       0         0         0         0            0         0        
predictionBounds  Logical,2 Logical,2 Logical,2 Logical,2    Logical,2 Logical,2
like image 192
agstudy Avatar answered Sep 25 '22 01:09

agstudy