I want to use the out-of-fold predictions from a caret model to train a second-stage model that includes some of the original predictors. I can collect the out-of-fold predictions as follows:
#Load Data
set.seed(1)
library(caret)
library(mlbench)
data(BostonHousing)
#Build Model (see ?train)
rpartFit <- train(medv ~ . + rm:lstat, data = BostonHousing, method="rpart",
trControl=trainControl(method='cv', number=folds,
savePredictions=TRUE))
#Collect out-of-fold predictions
out_of_fold <- rpartFit$pred
bestCP <- rpartFit$bestTune[,'.cp']
out_of_fold <- out_of_fold[out_of_fold$.cp==bestCP,]
Which is great, but they are in the wrong order:
> all.equal(out_of_fold$obs, BostonHousing$medv)
[1] "Mean relative difference: 0.4521906"
I know the train
object returns a list of which indexes were used to train each fold:
> str(rpartFit$control$index)
List of 10
$ Fold01: int [1:457] 1 2 3 4 5 6 7 8 9 10 ...
$ Fold02: int [1:454] 2 3 4 8 10 11 12 13 14 15 ...
$ Fold03: int [1:457] 1 2 3 4 5 6 7 8 9 10 ...
$ Fold04: int [1:455] 1 2 3 5 6 7 8 9 10 11 ...
$ Fold05: int [1:455] 1 2 3 4 5 6 7 8 9 10 ...
$ Fold06: int [1:455] 1 2 3 4 5 6 7 8 9 10 ...
$ Fold07: int [1:457] 1 3 4 5 6 7 8 9 10 13 ...
$ Fold08: int [1:455] 1 2 4 5 6 7 9 11 12 14 ...
$ Fold09: int [1:455] 1 2 3 4 5 6 7 8 9 10 ...
$ Fold10: int [1:454] 1 2 3 4 5 6 7 8 9 10 ...
How can I use this information to put the observations in my out_of_fold
object in the same order as the original BostonHousing
dataset?
An out-of-fold prediction is a prediction by the model during the k-fold cross-validation procedure. That is, out-of-fold predictions are those predictions made on the holdout datasets during the resampling procedure. If performed correctly, there will be one prediction for each example in the training dataset.
This means that each sample is given the opportunity to be used in the hold out set 1 time and used to train the model k-1 times. This approach involves randomly dividing the set of observations into k groups, or folds, of approximately equal size.
I'll add another column to the output that indicates the original row number for each sample in the next release (probably a month from now).
Max
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With