Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R caret / How does cross-validation for train within rfe work

I have a question regarding the rfe function from the caret library. On the caret-homepage link they give the following RFE algorithm: algorithm

For this example I am using the rfe function with 3-fold cross-validation and the train function with a linear-SVM and 5-fold cross-validation.

library(kernlab)
library(caret)
data(iris)

# parameters for the tune function, used for fitting the svm
trControl <- trainControl(method = "cv", number = 5)

# parameters for the RFE function
rfeControl <- rfeControl(functions = caretFuncs, method = "cv",
                     number= 4, verbose = FALSE )

rf1 <- rfe(as.matrix(iris[,1:4]), as.factor(iris[,5]) ,sizes = c( 2,3) ,  
           rfeControl = rfeControl, trControl = trControl, method = "svmLinear")
  • From the algorithm above I assumed that the algorithm would work with 2 nested cross-validations:
    1. rfe would split the data (150 samples) into 3 folds
    2. the train function would be run on the training-set (100 samples) with 5 fold cross validation to tune the model parameters - with subsequent RFE.

What confuses me is that when I take a look on the results of the rfe function:

> lapply(rf1$control$index, length)
$Fold1
[1] 100
$Fold2
[1] 101
$Fold3
[1] 99

> lapply(rf1$fit$control$index, length)
$Fold1
[1] 120
$Fold2
[1] 120
$Fold3
[1] 120
$Fold4
[1] 120
$Fold5
[1] 120

From that it appears that the size of the training sets from the 5-fold cv is 120 samples when I would expect a size of 80. ??

So it would be great if someone could clarify how rfe and train work together.

Cheers

> sessionInfo()
R version 2.15.1 (2012-06-22)
Platform: i386-apple-darwin9.8.0/i386 (32-bit)

locale:
[1] C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] pROC_1.5.4      e1071_1.6-1     class_7.3-5     caret_5.15-048 
 [5] foreach_1.4.0   cluster_1.14.3  plyr_1.7.1      reshape2_1.2.1 
 [9] lattice_0.20-10 kernlab_0.9-15 

loaded via a namespace (and not attached):
 [1] codetools_0.2-8 compiler_2.15.1 grid_2.15.1     iterators_1.0.6
 [5] stringr_0.6.1   tools_2.15.1   
like image 458
Fabian_G Avatar asked Jan 22 '13 19:01

Fabian_G


People also ask

What does train () do in R?

Implement the train() Function in R The train() method (from the caret library) is used for classification and regression training. It is also used to tune the models by picking the complexity parameters.

What is tuneGrid caret?

By default, caret will estimate a tuning grid for each method. However, sometimes the defaults are not the most sensible given the nature of the data. The tuneGrid argument allows the user to specify a custom grid of tuning parameters as opposed to simply using what exists implicitly.

What is trainControl?

Train Control means the control and regulation of all rail operations (including Train Movements, movements of rolling stock and track maintenance vehicles) to ensure the safe, efficient and proper operation of the Network.

What is tuneLength in random forest?

tuneLength = It allows system to tune algorithm automatically. It indicates the number of different values to try for each tunning parameter. For example, mtry for randomForest. Suppose, tuneLength = 5, it means try 5 different mtry values and find the optimal mtry value based on these 5 values.


1 Answers

The problem here is that lapply(rf1$fit$control$index, length) does not store what we think it does.

For me to understand that it was necessary to look into the code. What happens there is the following:

When you call rfe the whole data is passed to the nominalRfeWorkflow.

In nominalRfeWorkflow, the train and test data splitted according to rfeControl (in our example 3 times according to the 3-folded CV rule) is passed to rfeIter. These splits we can find in our result under rf1$control$index.

In rfeIter the ~100 training samples (our example) are used to find the final variables (which is the output of that function). As I understand it, the ~50 test samples (our example) are used to calculate the performance for the different variable sets but they are only stored as external performance but not used to select the final variables. For selecting these the performance estimates of the 5 fold cross validation are used. But we cannot find these indices in the final result returned by rfe. If we really need them, we need to fetch them from fitObject$control$index in rfeIter, return them to nominalRfeWorkflow, then to rfe and from there in the resulting rfe-Class object returned by rfe.

So what is stored in lapply(rf1$fit$control$index, length)? - When rfe found the best variables the final model fit is created with the best variables and the full reference data (150). rf1$fit is created in rfe as follows:

fit <- rfeControl$functions$fit(x[, bestVar, drop = FALSE], y, first = FALSE, last = TRUE, ...)

This function is again runs the train function and does a final cross validation with the full reference data, the final feature set and trControl given via the ellipses (...). Since our trControl is supposed to do 5 fold CV it is thus correct that lapply(rf1$fit$control$index, length) returns 120 since we have to calculate 150/5*4=120.

like image 99
ben Avatar answered Oct 16 '22 07:10

ben