I have carefully read the CARET documentation at: http://caret.r-forge.r-project.org/training.html, the vignettes, and everything is quite clear (the examples on the website help a lot!), but I am still a confused about the relationship between two arguments to <code>trainControl</code>: <pre class="prettyprint"><code>method index </code></pre> and the interplay between <code>trainControl</code> and the data splitting functions in caret (e.g. <code>createDataPartition</code>, <code>createResample</code>, <code>createFolds</code> and <code>createMultiFolds</code>) To better frame my questions, let me use the following example from the documentation: <pre class="prettyprint"><code>data(BloodBrain) set.seed(1) tmp <- createDataPartition(logBBB,p = .8, times = 100) trControl = trainControl(method = "LGOCV", index = tmp) ctreeFit <- train(bbbDescr, logBBB, "ctree",trControl=trControl) </code></pre> My questions are: <ol> <li>If I use <code>createDataPartition</code> (which I assume that does stratified bootstrapping), as in the above example, and I pass the result as <code>index</code> to <code>trainControl</code> do I need to use <code>LGOCV</code> as the method in my call <code>trainControl</code>? If I use another one (e.g. <code>cv</code>) What difference would it make? In my head, once you fix <code>index</code>, you are essentially choosing the type of cross-validation, so I am not sure what role <code>method</code> plays if you use <code>index</code>.</li> <li>What is the difference between <code>createDataPartition</code> and <code>createResample</code>? Is it that <code>createDataPartition</code> does stratified bootstrapping, while <code>createResample</code> doesn't?</li> </ol> 3) How can I do stratified k-fold (e.g. 10 fold) cross validation using caret? Would the following do it? <pre class="prettyprint"><code>tmp <- createFolds(logBBB, k=10, list=TRUE, times = 100) trControl = trainControl(method = "cv", index = tmp) ctreeFit <- train(bbbDescr, logBBB, "ctree",trControl=trControl) </code></pre>

If you are not sure what role method plays if you use index, why not to apply all the methods and compare results. It is a blind method of comparaison, but it can give you some intuitions. <pre class="prettyprint"><code> methods <- c('boot', 'boot632', 'cv', 'repeatedcv', 'LOOCV', 'LGOCV') </code></pre> I create my index: <pre class="prettyprint"><code> n <- 100 tmp <- createDataPartition(logBBB,p = .8, times = n) </code></pre> I apply <code>trainControl</code> for my list of method, and I remove index from result since it is common to all my methods. <pre class="prettyprint"><code>ll <- lapply(methods,function(x) trControl = trainControl(method = x, index = tmp)) ll <- sapply(ll,'[<-','index', NULL) </code></pre> Hence my ll is : <pre class="prettyprint"><code> [,1] [,2] [,3] [,4] [,5] [,6] method "boot" "boot632" "cv" "repeatedcv" "LOOCV" "LGOCV" number 25 25 10 10 25 25 repeats 25 25 1 1 25 25 verboseIter FALSE FALSE FALSE FALSE FALSE FALSE returnData TRUE TRUE TRUE TRUE TRUE TRUE returnResamp "final" "final" "final" "final" "final" "final" savePredictions FALSE FALSE FALSE FALSE FALSE FALSE p 0.75 0.75 0.75 0.75 0.75 0.75 classProbs FALSE FALSE FALSE FALSE FALSE FALSE summaryFunction ? ? ? ? ? ? selectionFunction "best" "best" "best" "best" "best" "best" preProcOptions List,3 List,3 List,3 List,3 List,3 List,3 custom NULL NULL NULL NULL NULL NULL timingSamps 0 0 0 0 0 0 predictionBounds Logical,2 Logical,2 Logical,2 Logical,2 Logical,2 Logical,2 </code></pre>

CARET. Relationship between data splitting and trainControl

Tags:

r

machine-learning

cross-validation

I have carefully read the CARET documentation at: http://caret.r-forge.r-project.org/training.html, the vignettes, and everything is quite clear (the examples on the website help a lot!), but I am still a confused about the relationship between two arguments to trainControl:

method 
index

and the interplay between trainControl and the data splitting functions in caret (e.g. createDataPartition, createResample, createFolds and createMultiFolds)

To better frame my questions, let me use the following example from the documentation:

data(BloodBrain)
set.seed(1)
tmp <- createDataPartition(logBBB,p = .8, times = 100)
trControl = trainControl(method = "LGOCV", index = tmp)
ctreeFit <- train(bbbDescr, logBBB, "ctree",trControl=trControl)

My questions are:

If I use createDataPartition (which I assume that does stratified bootstrapping), as in the above example, and I pass the result as index to trainControl do I need to use LGOCV as the method in my call trainControl? If I use another one (e.g. cv) What difference would it make? In my head, once you fix index, you are essentially choosing the type of cross-validation, so I am not sure what role method plays if you use index.
What is the difference between createDataPartition and createResample? Is it that createDataPartition does stratified bootstrapping, while createResample doesn't?

3) How can I do stratified k-fold (e.g. 10 fold) cross validation using caret? Would the following do it?

tmp <- createFolds(logBBB, k=10, list=TRUE,  times = 100)
trControl = trainControl(method = "cv", index = tmp)
ctreeFit <- train(bbbDescr, logBBB, "ctree",trControl=trControl)

855

asked Feb 19 '13 22:02

Amelio Vazquez-Reina

1 Answers

If you are not sure what role method plays if you use index, why not to apply all the methods and compare results. It is a blind method of comparaison, but it can give you some intuitions.

  methods <- c('boot', 'boot632', 'cv', 
               'repeatedcv', 'LOOCV', 'LGOCV')

I create my index:

  n <- 100
  tmp <- createDataPartition(logBBB,p = .8, times = n)

I apply trainControl for my list of method, and I remove index from result since it is common to all my methods.

ll <- lapply(methods,function(x)
         trControl = trainControl(method = x, index = tmp))
ll <- sapply(ll,'[<-','index', NULL)

Hence my ll is :

                 [,1]      [,2]      [,3]      [,4]         [,5]      [,6]     
method            "boot"    "boot632" "cv"      "repeatedcv" "LOOCV"   "LGOCV"  
number            25        25        10        10           25        25       
repeats           25        25        1         1            25        25       
verboseIter       FALSE     FALSE     FALSE     FALSE        FALSE     FALSE    
returnData        TRUE      TRUE      TRUE      TRUE         TRUE      TRUE     
returnResamp      "final"   "final"   "final"   "final"      "final"   "final"  
savePredictions   FALSE     FALSE     FALSE     FALSE        FALSE     FALSE    
p                 0.75      0.75      0.75      0.75         0.75      0.75     
classProbs        FALSE     FALSE     FALSE     FALSE        FALSE     FALSE    
summaryFunction   ?         ?         ?         ?            ?         ?        
selectionFunction "best"    "best"    "best"    "best"       "best"    "best"   
preProcOptions    List,3    List,3    List,3    List,3       List,3    List,3   
custom            NULL      NULL      NULL      NULL         NULL      NULL     
timingSamps       0         0         0         0            0         0        
predictionBounds  Logical,2 Logical,2 Logical,2 Logical,2    Logical,2 Logical,2

192

answered Sep 25 '22 01:09

agstudy

Related questions
                            
                                Avoid quotation marks in column and row names when using write.table [duplicate]
                            
                                How to prevent merge from reordering columns
                            
                                r package KernSmooth copyright
                            
                                How to automatically adjust the width of each facet for facet_wrap?
                            
                                Making a bar chart in ggplot with vertical labels in x axis
                            
                                ggplot format italic annotation
                            
                                How can I plot the residuals of lm() with ggplot?
                            
                                How can I remove non-numeric characters from strings using gsub in R?
                            
                                Summary statistics by two or more factor variables?
                            
                                Removing suffix from column names using rename_all?
                            
                                Loops in R - Need to use index, anyway to avoid 'for'?
                            
                                filtering data frame based on NA on multiple columns
                            
                                Hide sidebar in default in shinydashboard
                            
                                how to calculate mean/median per group in a dataframe in r [duplicate]
                            
                                Preserving large numbers
                            
                                Replace -inf, NaN and NA values with zero in a dataset in R
                            
                                Estimate Cohen's d for effect size
                            
                                How to Replace Raster Values Less than 0 to NA in R code

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With