Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

r caretEnsemble warning: indexes not defined in trControl

I have some r/caret code to fit several cross-validated models to some data, but I'm getting a warning message that I'm having trouble finding any information on. Is this something I should be concerned about?

library(datasets)
library(caret)
library(caretEnsemble)

# load data
data("iris")

# establish cross-validation structure
set.seed(32)
trainControl <- trainControl(method="repeatedcv", number=5, repeats=3, savePredictions=TRUE, search="random")

# fit several (cross-validated) models 
algorithmList <- c('lda',         # Linear Discriminant Analysis 
                   'rpart' ,      # Classification and Regression Trees
                   'svmRadial')   # SVM with RBF Kernel

models <- caretList(Species~., data=iris, trControl=trainControl, methodList=algorithmList)

log output:

Warning messages:
1: In trControlCheck(x = trControl, y = target) :
  x$savePredictions == TRUE is depreciated. Setting to 'final' instead.
2: In trControlCheck(x = trControl, y = target) :
  indexes not defined in trControl.  Attempting to set them ourselves, so each model in the ensemble will have the same resampling indexes.

...I thought my trainControl object, defining a cross-validation structure (of 3x 5-fold cross-validation) would generate a set of indices for the cv splits. So I'm confused why I would get this message.

like image 421
Max Power Avatar asked Jul 18 '17 01:07

Max Power


1 Answers

trainControl does not by default generate you the indices, it acts as a way of passing all the parameters to each model you are training.

When we search github issues regarding the error we can find this particular issue.

You need to make sure that every model is fit with the EXACT same resampling folds. caretEnsemble builds the ensemble by merging together the test sets for each cross-validation fold, and you will get incorrect results if each fold has different observations in it.

Before you fit your models, you need to construct a trainControl object, and manually set the indexes in that object.

E.g. myControl <- trainControl(index=createFolds(y, 10)).

We are working on an interface to caretEnsemble that handles constructing the resampling strategy for you and then fitting multiple models using those resamples, but it is not yet finished.

To reiterate, that check is there for a reason. You need to set the index argument in trainControl, and pass the EXACT SAME indexes to each model you wish to ensemble.

So what that means is when you specify number = 5 and repeats = 3 the models aren't actually getting a predetermined index for what samples belong to each fold but are rather generating their own independently.

Therefore to ensure that the models are consistent with one another regarding which samples belong to which folds you must specify index = createFolds(iris$Species, 5) in your trainControl object

# new trainControl object with index specified
trainControl <- trainControl(method = "repeatedcv",
                             number = 5,
                             index = createFolds(iris$Species, 5),
                             repeats = 3,
                             savePredictions = "all",
                             search = "random")
like image 65
zacdav Avatar answered Oct 03 '22 02:10

zacdav