I have detected a peculiar effect that RMSE gets lower for the testing set than that of the training set with the sample
function with the caret
package.
My code does a common split of training and testing set:
set.seed(seed)
training.index <- createDataPartition(dataset[[target_label]], p = 0.8, list = FALSE)
training.set <- dataset[training.index, ]
testing.set <- dataset[-training.index, ]
This e.g. gives an RMSE for testing set 0.651
which is higher than training set RMSE 0.575
- as expected.
Following the recommendation of many sources, e.g. here, the data should be shuffled, so I do it before the above split:
# shuffle data - short version:
set.seed(17)
dataset <- data %>% nrow %>% sample %>% data[.,]
After this shuffle, the testing set RMSE gets lower 0.528
than the training set RMSE 0.575
! This finding is consistent across a number of algorithms including lm, glm, knn, kknn, rf, gbm, svmLinear, svmRadial
etc.
According to my knowledge, the default of sample() is replace = FALSE
so there can't be any data leakage into the testing set. The same observation occurs in classification (for accuracy and kappa) although the createDataPartition
performs stratification, so any data imbalance should be handled.
I don't use any extraordinary configuration, just ordinary cross-validation:
training.configuration <- trainControl(
method = "repeatedcv", number = 10
, repeats = CV.REPEATS
, savePredictions = "final",
# , returnResamp = "all"
)
What did I miss here?
--
I checked the data distribution and found a potential hint for the described effect.
Training set distribution:
. Freq prop
1 1 124 13.581599
2 2 581 63.636364
3 3 194 21.248631
4 4 14 1.533406
Testing set distribution without shuffle:
. Freq prop
1 1 42 18.502203
2 2 134 59.030837
3 3 45 19.823789
4 4 6 2.643172
Testing set distribution with shuffle:
. Freq prop
1 1 37 16.299559
2 2 139 61.233480
3 3 45 19.823789
4 4 6 2.643172
If we look at the mode (most frequent value), its proportion in the testing set with shuffle 61.2%
is closer to the training set proportion 63.6%
than without shuffle 59.0%
.
I don't know how to interpret this statistically by underlying theory - can anybody?
An intuition of mine is that the shuffling makes the stratification of the testing set distribution (implicitly performed by createDataPartition()
) "more stratified" - by that I mean "closer to the training set distribution". This may cause the effect of data leakage into the opposite direction - into the testing set.
library(caret)
library(tidyverse)
library(magrittr)
library(mlbench)
data(BostonHousing)
seed <- 171
# shuffled <- TRUE
shuffled <- FALSE
if (shuffled) {
dataset <- BostonHousing %>% nrow %>% sample %>% BostonHousing[., ]
} else {
dataset <- BostonHousing %>% as_tibble()
}
target_label <- "medv"
features_labels <- dataset %>% select_if(is.numeric) %>%
select(-target_label) %>% names %T>% print
# define ml algorithms to train
algorithm_list <- c(
"lm"
, "glmnet"
, "knn"
, "gbm"
, "rf"
)
# repeated cv
training_configuration <- trainControl(
method = "repeatedcv", number = 10
, repeats = 10
, savePredictions = "final",
# , returnResamp = "all"
)
# preprocess by standardization within each k-fold
preprocess_configuration = c("center", "scale")
# select variables
dataset %<>% select(target_label, features_labels) %>% na.omit
# dataset subsetting for tibble: [[
set.seed(seed)
training.index <- createDataPartition(dataset[[target_label]], p = 0.8, list = FALSE)
training.set <- dataset[training.index, ]
testing.set <- testing.set <- dataset[-training.index, ]
########################################
# 3.2: Select the target & features
########################################
target <- training.set[[target_label]]
features <- training.set %>% select(features_labels) %>% as.data.frame
########################################
# 3.3: Train the models
########################################
models.list <- list()
models.list <- algorithm_list %>%
map(function(algorithm_label) {
model <- train(
x = features,
y = target,
method = algorithm_label,
preProcess = preprocess_configuration,
trControl = training_configuration
)
return(model)
}
) %>%
setNames(algorithm_list)
observed <- testing.set[[target_label]]
models.list %>%
predict(testing.set) %>%
map_df(function(predicted) {
sqrt(mean((observed - predicted)^2))
}) %>%
t %>% as_tibble(rownames = "model") %>%
rename(RMSE.testing = V1) %>%
arrange(RMSE.testing) %>%
as.data.frame
Running this code both for shuffled = FALSE
and shuffled = TRUE
on the testing.set
gives:
model RMSE.testing RMSE.testing.shuffled
1 gbm 3.436164 2.355525
2 glmnet 4.516441 3.785895
3 knn 3.175147 3.340218
4 lm 4.501077 3.843405
5 rf 3.366466 2.092024
The effect is reproducible!
The reason you get a different test RMSE is because you have a different test set. You are shuffling your data and then using the same training.index
each time so there's no reason to believe the test set would be the same each time.
In your original comparison you need to compare the RMSE from the shuffled test data with the RMSE of the shuffled training data, not the original training data.
Edit: the shuffling is also unnecessary as createDataPartition
has its own sampling scheme. You can just change the seed if you want a different test/training split
I completely agree with the answer of Jonny Phelps. Based on your code and caret functions code there is no reason to suspect any kind of data leakage when using createDataPartition
on shuffled data. So the variation in performance must be due to different train/test splits.
In order to prove this I checked the performance of the shuffled and non-shuffled workflow using 10 different seeds with 4 algorithms:
I omitted lm
and replaced gbm
with xgboost
. The reason is my own preference.
In my opinion the result suggests there is no performance pattern between shuffled and non shuffled data. Perhaps only KNN looks suspicions. But this is just one algorithm.
Code:
create initial seeds:
set.seed(1)
gr <- expand.grid(sample = sample(1L:1e5L, 10),
shuffled = c(FALSE, TRUE))
loop over initial seeds:
apply(gr, 1, function(x){
print(x)
shuffled <- x[2]
set.seed(x[1])
if (shuffled) {
dataset <- BostonHousing %>% nrow %>% sample %>% BostonHousing[., ]
} else {
dataset <- BostonHousing %>% as_tibble()
}
target_label <- "medv"
features_labels <- dataset %>% select_if(is.numeric) %>%
select(-target_label) %>% names %T>% print
algorithm_list <- c(
"glmnet",
"knn",
"rf",
"xgbTree"
)
training_configuration <- trainControl(
method = "repeatedcv",
number = 5, #5 folds and 3 reps is plenty
repeats = 3,
savePredictions = "final",
search = "random") #tune hyper parameters
preprocess_configuration = c("center", "scale")
dataset %<>% select(target_label, features_labels) %>% na.omit
set.seed(x[1])
training.index <- createDataPartition(dataset[[target_label]], p = 0.8, list = FALSE)
training.set <- dataset[training.index, ]
testing.set <- testing.set <- dataset[-training.index, ]
target <- training.set[[target_label]]
features <- training.set %>% select(features_labels) %>% as.data.frame
models.list <- list()
models.list <- algorithm_list %>%
map(function(algorithm_label) {
model <- train(
x = features,
y = target,
method = algorithm_label,
preProcess = preprocess_configuration,
trControl = training_configuration,
tuneLength = 100 #get decent hyper parameters
)
return(model)
}
) %>%
setNames(algorithm_list)
observed <- testing.set[[target_label]]
models.list %>%
predict(testing.set) %>%
map_df(function(predicted) {
sqrt(mean((observed - predicted)^2))
}) %>%
t %>% as_tibble(rownames = "model") %>%
rename(RMSE.testing = V1) %>%
arrange(RMSE.testing) %>%
as.data.frame
}) -> perf
do.call(rbind, perf) %>%
mutate(shuffled = rep(c(FALSE, TRUE), each = 40)) %>%
ggplot()+
geom_boxplot(aes(x = model, y = RMSE.testing, color = shuffled)) +
theme_bw()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With