Out-of-fold vs training error in caret

Tags:

Using cross validation in model tuning, I get different error rates from caret::train's results object and calculating the error myself on its pred object. I'd like to understand why they differ, and ideally how to use out-of-fold error rates for model selection, plotting model performance, etc.

The pred object contains out-of-fold predictions. The docs are pretty clear that trainControl(..., savePredictions = "final") saves out-of-fold predictions for the best hyperparameter values: "an indicator of how much of the hold-out predictions for each resample should be saved... "final" saves the predictions for the optimal tuning parameters." (Keeping "all" predictions and then filtering to the best tuning values doesn't resolve the issue.)

The train docs say that the results object is "a data frame the training error rate..." I'm not sure what that means, but the values for the best row are consistently different from the metrics calculated on pred. Why do they differ and how can I make them line up?

d <- data.frame(y = rnorm(50))
d$x1 <- rnorm(50, d$y)
d$x2 <- rnorm(50, d$y)
train_control <- caret::trainControl(method = "cv",
                                     number = 4,
                                     search = "random",
                                     savePredictions = "final")
m <- caret::train(x = d[, -1],
                     y = d$y,
                     method = "ranger",
                     trControl = train_control,
                     tuneLength = 3)
#> Loading required package: lattice
#> Loading required package: ggplot2
m
#> Random Forest 
#> 
#> 50 samples
#>  2 predictor
#> 
#> No pre-processing
#> Resampling: Cross-Validated (4 fold) 
#> Summary of sample sizes: 38, 36, 38, 38 
#> Resampling results across tuning parameters:
#> 
#>   min.node.size  mtry  splitrule   RMSE       Rsquared   MAE      
#>   1              2     maxstat     0.5981673  0.6724245  0.4993722
#>   3              1     extratrees  0.5861116  0.7010012  0.4938035
#>   4              2     maxstat     0.6017491  0.6661093  0.4999057
#> 
#> RMSE was used to select the optimal model using the smallest value.
#> The final values used for the model were mtry = 1, splitrule =
#>  extratrees and min.node.size = 3.
MLmetrics::RMSE(m$pred$pred, m$pred$obs)
#> [1] 0.609202
MLmetrics::R2_Score(m$pred$pred, m$pred$obs)
#> [1] 0.642394

Created on 2018-04-09 by the reprex package (v0.2.0).

281

asked Apr 10 '18 00:04

mlevy

2 Answers

The RMSE for cross validation is not calculated the way you show, but rather for each fold and then averaged. Full example:

set.seed(1)
d <- data.frame(y = rnorm(50))
d$x1 <- rnorm(50, d$y)
d$x2 <- rnorm(50, d$y)
train_control <- caret::trainControl(method = "cv",
                                     number = 4,
                                     search = "random",
                                     savePredictions = "final")
set.seed(1)
m <- caret::train(x = d[, -1],
                  y = d$y,
                  method = "ranger",
                  trControl = train_control,
                  tuneLength = 3)
#output
Random Forest 

50 samples
 2 predictor

No pre-processing
Resampling: Cross-Validated (4 fold) 
Summary of sample sizes: 37, 38, 37, 38 
Resampling results across tuning parameters:

  min.node.size  mtry  splitrule   RMSE       Rsquared   MAE      
   8             1     extratrees  0.6106390  0.4360609  0.4926629
  12             2     extratrees  0.6156636  0.4294237  0.4954481
  19             2     variance    0.6472539  0.3889372  0.5217369

RMSE was used to select the optimal model using the smallest value.
The final values used for the model were mtry = 1, splitrule = extratrees and min.node.size = 8.

RMSE for best model is 0.6106390

Now calculate the RMSE for each fold and average:

m$pred %>%
  group_by(Resample) %>%
  mutate(rmse = caret::RMSE(pred, obs)) %>%
  summarise(mean = mean(rmse)) %>%
  pull(mean) %>%
  mean
#output
0.610639

m$pred %>%
  group_by(Resample) %>%
  mutate(rmse = MLmetrics::RMSE(pred, obs)) %>%
  summarise(mean = mean(rmse)) %>%
  pull(mean) %>%
  mean
#output
0.610639

176

answered Sep 28 '22 06:09

missuse

I get different results. This is apparently a random process.

MLmetrics::RMSE(m$pred$pred, m$pred$obs)
[1] 0.5824464
> MLmetrics::R2_Score(m$pred$pred, m$pred$obs)
[1] 0.5271595

If you want a random (more accurately a pseudo-random process to be reproducible, then use set.seed immediately prior to the call.

answered Sep 28 '22 06:09

IRTFM

Related questions
                            
                                Js - Alternatives to eval mathematical expression with operator as string
                            
                                Purpose of Call<T> type with Retrofit POST
                            
                                JavaScript ChildNodes Undefined type error?
                            
                                OpenLayers 4 - fit to extent of selected features
                            
                                shebang not working for python script
                            
                                Postman test throws "TypeError: Cannot read property 'get' of undefined"
                            
                                Is it possible to pass the spring.cloud.config.uri along with the docker run command?
                            
                                Dotnet build fails for projects containing UserControl (InitializeComponent does not exist in the current context)
                            
                                Error while trying to access files in the app
                            
                                How to pass options to UglifyJS through html-minifier on Windows command line?
                            
                                How to do an SQL UPDATE based on INSERT RETURNING id in Postgres?
                            
                                Textarea v-model initial value with VueJS and Laravel

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With