I'm trying to use SMOTE in R within the trainControl function in caret. Following the author's example I do as follows:
#first, create an imbalanced data set
set.seed(2969)
imbal_train <- twoClassSim(10000, intercept = -20, linearVars = 20)
imbal_test <- twoClassSim(10000, intercept = -20, linearVars = 20)
table(imbal_train$Class)
Class1 Class2
9411 589
I want to use the SMOTE algorithm to oversample my minority class. However, this has to be done carefully. For instance, we shouldn't oversample before doing cross validation. This would lead us to optimistic generalization error.
#create my folds (5 in this case)
folds <- createFolds(factor(imbal_train$Class), k = 5, list = TRUE,returnTrain=TRUE)
#trainControl to set up my training phase.
ctrl <- trainControl(method = "cv", index = folds,
classProbs = TRUE,
summaryFunction = twoClassSummary,
savePredictions = "all",
## new option here:
sampling = "smote")
#train the model
set.seed(5627)
smote_inside <- train(Class ~ ., data = imbal_train,
method = "treebag",
nbagg = 50,
metric = "ROC",
trControl = ctrl)
It runs without error. I now want to see the training and testing set used in each iteration. I need to make sure that before oversampling the training folders, one folder was hold out and no new synthetic records were created inside of it.
Looking into the objects output by train, I could see that smote_inside$control may have some information. Concretely, it has the index and index_out: these are the row indexes for the training and testing in each cv iteration. However, when I do :
lista=smote_inside$control
dd=imbal_train[lista$index$Fold1,] #training data first cv iteration
table(dd$Class)
Class1 Class2
7529 471
You can see that it is still imbalanced. SMOTE is supposed to create some synthetic records from the minority class. Maybe this information is saved in another place?
Questions:
How can I see the new training records that were created using smote to balance the data?
How can I be sure that the testing folder wasn't contaminated with the oversampling?
Where can I find what caret is doing with SMOTE? pointers to a source code.
Some answers:
It does not retain that information
It is designed not to contaminate the holdout data. If you want proof (beyond what is shown in the link that you reference), look at createModel to see how it does the sampling and predictionFunction for how the data are handled prior to prediction.
The package sources are available basically everywhere. The two functions above (along with probFunction) to the work.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With