Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to specify a validation holdout set to caret

I really like using caret for at least the early stages of modeling, especially for it's really easy to use resampling methods. However, I'm working on a model where the training set has a fair number of cases added via semi-supervised self-training and my cross-validation results are really skewed because of it. My solution to this is using a validation set to measure model performance but I can't see a way use a validation set directly within caret - am I missing something or this just not supported? I know that I can write my own wrappers to do what caret would normally do for m, but it would be really nice if there is a work-around without having to do that.

Here is a trivial example of what I am experiencing:

> library(caret)
> set.seed(1)
> 
> #training/validation sets
> i <- sample(150,50)
> train <- iris[-i,]
> valid <- iris[i,]
> 
> #make my model
> tc <- trainControl(method="cv")
> model.rf <- train(Species ~ ., data=train,method="rf",trControl=tc)
> 
> #model parameters are selected using CV results...
> model.rf
100 samples
  4 predictors
  3 classes: 'setosa', 'versicolor', 'virginica' 

No pre-processing
Resampling: Cross-Validation (10 fold) 

Summary of sample sizes: 90, 90, 90, 89, 90, 92, ... 

Resampling results across tuning parameters:

  mtry  Accuracy  Kappa  Accuracy SD  Kappa SD
  2     0.971     0.956  0.0469       0.0717  
  3     0.971     0.956  0.0469       0.0717  
  4     0.971     0.956  0.0469       0.0717  

Accuracy was used to select the optimal model using  the largest value.
The final value used for the model was mtry = 2. 
> 
> #have to manually check validation set
> valid.pred <- predict(model.rf,valid)
> table(valid.pred,valid$Species)

valid.pred   setosa versicolor virginica
  setosa         17          0         0
  versicolor      0         20         1
  virginica       0          2        10
> mean(valid.pred==valid$Species)
[1] 0.94

I originally thought I could do this by creating a custom summaryFunction() for a trainControl() object but I cannot see how to reference my model object to get predictions from the validation set (the documentation - http://caret.r-forge.r-project.org/training.html - lists only "data", "lev" and "model" as possible parameters). For example this clearly will not work:

tc$summaryFunction <- function(data, lev = NULL, model = NULL){
  data.frame(Accuracy=mean(predict(<model object>,valid)==valid$Species))
}

EDIT: In an attempt to come up with a truly ugly fix, I've been looking see if I can access the model object from the scope of another function, but I'm not even seeing them model stored anywhere. Hopefully there is some elegant solution that I'm not even coming close to seeing...

> tc$summaryFunction <- function(data, lev = NULL, model = NULL){
+   browser()
+   data.frame(Accuracy=mean(predict(model,valid)==valid$Species))
+ }
> train(Species ~ ., data=train,method="rf",trControl=tc)
note: only 1 unique complexity parameters in default grid. Truncating the grid to 1 .

Called from: trControl$summaryFunction(testOutput, classLevels, method)
Browse[1]> lapply(sys.frames(),function(x) ls(envi=x))
[[1]]
[1] "x"

[[2]]
 [1] "cons"      "contrasts" "data"      "form"      "m"         "na.action" "subset"   
 [8] "Terms"     "w"         "weights"   "x"         "xint"      "y"        

[[3]]
[1] "x"

[[4]]
 [1] "classLevels" "funcCall"    "maximize"    "method"      "metric"      "modelInfo"  
 [7] "modelType"   "paramCols"   "ppMethods"   "preProcess"  "startTime"   "testOutput" 
[13] "trainData"   "trainInfo"   "trControl"   "tuneGrid"    "tuneLength"  "weights"    
[19] "x"           "y"          

[[5]]
[1] "data"  "lev"   "model"
like image 670
David Avatar asked Dec 21 '22 01:12

David


2 Answers

Take a look at trainControl. There are now options to directly specify the rows of the data that are used to model the data (the index argument) and which rows should be used to compute the hold-out estimates (called indexOut). I think that does what you are looking for.

Max

like image 115
topepo Avatar answered Dec 28 '22 08:12

topepo


I think I may've found a work-around for this but I'm not 100% that it is doing what I want and I am still hoping that someone comes up with something a bit more elegant. Anyway, I realized that it probably makes the most sense to include the validation set inside my training set and just define the resampling measures to only use the validation data. I think this should do the trick for the example above:

> library(caret)
> set.seed(1)
> 
> #training/validation set indices
> i <- sample(150,50) #note - I no longer need to explictly create train/validation sets
> 
> #explicity define the cross-validation indices to be those from the validation set
> tc <- trainControl(method="cv",number=1,index=list(Fold1=(1:150)[-i]),savePredictions=T)
> (model.rf <- train(Species ~ ., data=iris,method="rf",trControl=tc))
150 samples
  4 predictors
  3 classes: 'setosa', 'versicolor', 'virginica' 

No pre-processing
Resampling: Cross-Validation (1 fold) 

Summary of sample sizes: 100 

Resampling results across tuning parameters:

  mtry  Accuracy  Kappa
  2     0.94      0.907
  3     0.94      0.907
  4     0.94      0.907

Accuracy was used to select the optimal model using  the largest value.
The final value used for the model was mtry = 2. 
> 
> #i think this worked because the resampling indices line up?
> all(sort(unique(model.rf$pred$rowIndex)) == sort(i))
[1] TRUE
> #exact contingency from above also indicate that this works
> table(model.rf$pred[model.rf$pred$.mtry==model.rf$bestTune[[1]],c("obs","pred")])
            pred
obs          setosa versicolor virginica
  setosa         17          0         0
  versicolor      0         20         2
  virginica       0          1        10
like image 22
David Avatar answered Dec 28 '22 08:12

David