data shuffling by sample() decreases RMSE to lower value in testingset than trainingset

Question

I have detected a peculiar effect that RMSE gets lower for the testing set than that of the training set with the sample function with the caret package.

My code does a common split of training and testing set:

  set.seed(seed)
  training.index <- createDataPartition(dataset[[target_label]], p = 0.8, list = FALSE)
  training.set <- dataset[training.index, ]
  testing.set <- dataset[-training.index, ]

This e.g. gives an RMSE for testing set 0.651 which is higher than training set RMSE 0.575 - as expected.

Following the recommendation of many sources, e.g. here, the data should be shuffled, so I do it before the above split:

  # shuffle data - short version:
  set.seed(17)
  dataset <- data %>% nrow %>% sample %>% data[.,]

After this shuffle, the testing set RMSE gets lower 0.528 than the training set RMSE 0.575! This finding is consistent across a number of algorithms including lm, glm, knn, kknn, rf, gbm, svmLinear, svmRadial etc.

According to my knowledge, the default of sample() is replace = FALSE so there can't be any data leakage into the testing set. The same observation occurs in classification (for accuracy and kappa) although the createDataPartition performs stratification, so any data imbalance should be handled.

I don't use any extraordinary configuration, just ordinary cross-validation:

  training.configuration <- trainControl(
    method = "repeatedcv", number = 10
    , repeats = CV.REPEATS
    , savePredictions = "final",
    # , returnResamp = "all"
    )

What did I miss here?

--

Update 1: Hunch about data leakage into testing set

I checked the data distribution and found a potential hint for the described effect.

Training set distribution:

  . Freq      prop
1 1  124 13.581599
2 2  581 63.636364
3 3  194 21.248631
4 4   14  1.533406

Testing set distribution without shuffle:

  . Freq      prop
1 1   42 18.502203
2 2  134 59.030837
3 3   45 19.823789
4 4    6  2.643172

Testing set distribution with shuffle:

  . Freq      prop
1 1   37 16.299559
2 2  139 61.233480
3 3   45 19.823789
4 4    6  2.643172

If we look at the mode (most frequent value), its proportion in the testing set with shuffle 61.2% is closer to the training set proportion 63.6% than without shuffle 59.0%.

I don't know how to interpret this statistically by underlying theory - can anybody?

An intuition of mine is that the shuffling makes the stratification of the testing set distribution (implicitly performed by createDataPartition()) "more stratified" - by that I mean "closer to the training set distribution". This may cause the effect of data leakage into the opposite direction - into the testing set.

Update 2: Reproducible Code

library(caret)
library(tidyverse)
library(magrittr)
library(mlbench)

data(BostonHousing)

seed <- 171

# shuffled <- TRUE
shuffled <- FALSE

if (shuffled) {
  dataset <- BostonHousing %>% nrow %>% sample %>% BostonHousing[., ]
} else {
  dataset <- BostonHousing %>% as_tibble()
}

target_label <- "medv"
features_labels <- dataset  %>% select_if(is.numeric) %>%
  select(-target_label) %>% names %T>% print

# define ml algorithms to train
algorithm_list <- c(
  "lm"
  , "glmnet"
  , "knn"
  , "gbm"
  , "rf"
)

# repeated cv
training_configuration <- trainControl(
  method = "repeatedcv", number = 10
  , repeats = 10
  , savePredictions = "final",
  # , returnResamp = "all"
)

# preprocess by standardization within each k-fold
preprocess_configuration = c("center", "scale")

# select variables
dataset %<>% select(target_label, features_labels) %>% na.omit

# dataset subsetting for tibble: [[
set.seed(seed)
training.index <- createDataPartition(dataset[[target_label]], p = 0.8, list = FALSE)
training.set <- dataset[training.index, ]
testing.set <- testing.set <- dataset[-training.index, ]

########################################
# 3.2: Select the target & features
########################################
target <- training.set[[target_label]]
features <- training.set %>% select(features_labels) %>% as.data.frame

########################################
# 3.3: Train the models
########################################
models.list <- list()

models.list <- algorithm_list %>%

  map(function(algorithm_label) {
    model <- train(
      x = features,
      y = target,
      method = algorithm_label,
      preProcess = preprocess_configuration,
      trControl = training_configuration
    )
    return(model)
    }
  ) %>%
  setNames(algorithm_list)

UPDATE: Code to calculate testingset performance

observed <- testing.set[[target_label]]
models.list %>%
  predict(testing.set) %>%
  map_df(function(predicted) {
    sqrt(mean((observed - predicted)^2))
    }) %>%
  t %>% as_tibble(rownames = "model") %>%
  rename(RMSE.testing = V1) %>%
  arrange(RMSE.testing) %>%
  as.data.frame

Running this code both for shuffled = FALSE and shuffled = TRUE on the testing.set gives:

   model RMSE.testing RMSE.testing.shuffled
1    gbm       3.436164       2.355525
2 glmnet       4.516441       3.785895
3    knn       3.175147       3.340218
4     lm       4.501077       3.843405
5     rf       3.366466       2.092024

The effect is reproducible!

Jonny Phelps · Accepted Answer

The reason you get a different test RMSE is because you have a different test set. You are shuffling your data and then using the same training.index each time so there's no reason to believe the test set would be the same each time.

In your original comparison you need to compare the RMSE from the shuffled test data with the RMSE of the shuffled training data, not the original training data.

Edit: the shuffling is also unnecessary as createDataPartition has its own sampling scheme. You can just change the seed if you want a different test/training split

missuse · Answer

I completely agree with the answer of Jonny Phelps. Based on your code and caret functions code there is no reason to suspect any kind of data leakage when using createDataPartition on shuffled data. So the variation in performance must be due to different train/test splits.

In order to prove this I checked the performance of the shuffled and non-shuffled workflow using 10 different seeds with 4 algorithms:

enter image description here

I omitted lm and replaced gbm with xgboost. The reason is my own preference.

In my opinion the result suggests there is no performance pattern between shuffled and non shuffled data. Perhaps only KNN looks suspicions. But this is just one algorithm.

Code:

create initial seeds:

set.seed(1)
gr <- expand.grid(sample = sample(1L:1e5L, 10),
                  shuffled = c(FALSE, TRUE))

loop over initial seeds:

apply(gr, 1, function(x){
  print(x)
  shuffled <- x[2]
  set.seed(x[1])

  if (shuffled) {
    dataset <- BostonHousing %>% nrow %>% sample %>% BostonHousing[., ]
  } else {
    dataset <- BostonHousing %>% as_tibble()
  }

  target_label <- "medv"
  features_labels <- dataset  %>% select_if(is.numeric) %>%
    select(-target_label) %>% names %T>% print


  algorithm_list <- c(
    "glmnet",
    "knn",
    "rf",
    "xgbTree"
  )

  training_configuration <- trainControl(
    method = "repeatedcv",
    number = 5, #5 folds and 3 reps is plenty 
    repeats = 3,
    savePredictions = "final",
    search = "random") #tune hyper parameters 

  preprocess_configuration = c("center", "scale")

  dataset %<>% select(target_label, features_labels) %>% na.omit

  set.seed(x[1])
  training.index <- createDataPartition(dataset[[target_label]], p = 0.8, list = FALSE)

  training.set <- dataset[training.index, ]
  testing.set <- testing.set <- dataset[-training.index, ]

  target <- training.set[[target_label]]
  features <- training.set %>% select(features_labels) %>% as.data.frame

  models.list <- list()

  models.list <- algorithm_list %>%

    map(function(algorithm_label) {
      model <- train(
        x = features,
        y = target,
        method = algorithm_label,
        preProcess = preprocess_configuration,
        trControl = training_configuration,
        tuneLength = 100 #get decent hyper parameters
      )
      return(model)
    }
    ) %>%
    setNames(algorithm_list)

  observed <- testing.set[[target_label]]
  models.list %>%
    predict(testing.set) %>%
    map_df(function(predicted) {
      sqrt(mean((observed - predicted)^2))
    }) %>%
    t %>% as_tibble(rownames = "model") %>%
    rename(RMSE.testing = V1) %>%
    arrange(RMSE.testing) %>%
    as.data.frame
}) -> perf

do.call(rbind, perf) %>%
  mutate(shuffled = rep(c(FALSE, TRUE), each = 40)) %>%
  ggplot()+
  geom_boxplot(aes(x = model, y = RMSE.testing, color = shuffled)) +
  theme_bw()

data shuffling by sample() decreases RMSE to lower value in testingset than trainingset

Tags:

r

machine-learning

statistics

prediction

r-caret

Update 1: Hunch about data leakage into testing set

Update 2: Reproducible Code

UPDATE: Code to calculate testingset performance

Agile Bean

2 Answers

Jonny Phelps

missuse

Recent Activity

Donate For Us

data shuffling by sample() decreases RMSE to lower value in testingset than trainingset

Tags:

r

machine-learning

statistics

prediction

r-caret

Update 1: Hunch about data leakage into testing set

Update 2: Reproducible Code

UPDATE: Code to calculate testingset performance

Agile Bean

2 Answers

Jonny Phelps

missuse

Related questions

Recent Activity

Donate For Us