Train time series models in caret by group

Tags:

I have a data set like the following

set.seed(503)
foo <- data.table(group = rep(LETTERS[1:6], 150),
                  y  = rnorm(n = 6 * 150, mean = 5, sd = 2),
                  x1 = rnorm(n = 6 * 150, mean = 5, sd = 10),
                  x2 = rnorm(n = 6 * 150, mean = 25, sd = 10),
                  x3 = rnorm(n = 6 * 150, mean = 50, sd = 10),
                  x4 = rnorm(n = 6 * 150, mean = 0.5, sd = 10),
                  x5 = sample(c(1, 0), size = 6 * 150, replace = T))

foo[, period := 1:.N, by = group]

Problem: I want to forecast y one step ahead, for each group, using variables x1, ..., x5

I want to run a few models in caret to decide which I will use.

As of now, I am running it in a loop using timeslice

window.length <- 115
timecontrol   <- trainControl(method          = 'timeslice',
                            initialWindow     = window.length,
                            horizon           = 1, 
                            selectionFunction = "best",
                            fixedWindow       = TRUE, 
                            savePredictions   = 'final')

model_list <- list()
for(g in unique(foo$group)){
  for(model in c("xgbTree", "earth", "cubist")){
    dat <- foo[group == g][, c('group', 'period') := NULL]
    model_list[[g]][[model]] <- train(y ~ . - 1,
                                      data = dat,
                                      method = model, 
                                      trControl = timecontrol)

  }
}

However, I would like to run all groups at the same time, using dummy variables to identify each one, like

dat <- cbind(foo,  model.matrix(~ group- 1, foo))
            y         x1       x2       x3            x4 x5 period groupA groupB groupC groupD groupE groupF
  1: 5.710250 11.9615460 22.62916 31.04790 -4.821331e-04  1      1      1      0      0      0      0      0
  2: 3.442213  8.6558983 32.41881 45.70801  3.255423e-01  1      1      0      1      0      0      0      0
  3: 3.485286  7.7295448 21.99022 56.42133  8.668391e+00  1      1      0      0      1      0      0      0
  4: 9.659601  0.9166456 30.34609 55.72661 -7.666063e+00  1      1      0      0      0      1      0      0
  5: 5.567950  3.0306864 22.07813 52.21099  5.377153e-01  1      1      0      0      0      0      1      0

But still running the time series with the correct time ordering using timeslice.

Is there a way to declare the time variable in trainControl, so my one step ahead forecast uses, in this case, six more observations for each round and droping the first 6 observations?

I can do it by ordering the data and messing with the horizon argument (given n groups, order by the time variable and put horizon = n), but this has to change if the number of groups change. And initial.window will have to be time * n_groups

timecontrol   <- trainControl(method          = 'timeslice',
                            initialWindow     = window.length * length(unique(foo$group)),
                            horizon           = length(unique(foo$group)), 
                            selectionFunction = "best",
                            fixedWindow       = TRUE, 
                            savePredictions   = 'final')

Is there any ohter way?

328

asked Apr 09 '19 13:04

Felipe Alvarenga

1 Answers

I think the answer you are looking for is actually quite simple. You can use the skip argument to trainControl() to skip the desired number of observations after each train/test set. In this way, you only predict each group-period once, the same period is never split between the training group and testing group, and there is no information leakage.

Using the example you provided, if you set skip = 6 and horizon = 6 (the number of groups), and initialWindow = 115, then the first test set will include all groups for period 116, the next test set will include all groups for period 117, and so on.

library(caret)
library(tidyverse)

set.seed(503)
foo <- tibble(group = rep(LETTERS[1:6], 150),
                  y  = rnorm(n = 6 * 150, mean = 5, sd = 2),
                  x1 = rnorm(n = 6 * 150, mean = 5, sd = 10),
                  x2 = rnorm(n = 6 * 150, mean = 25, sd = 10),
                  x3 = rnorm(n = 6 * 150, mean = 50, sd = 10),
                  x4 = rnorm(n = 6 * 150, mean = 0.5, sd = 10),
                  x5 = sample(c(1, 0), size = 6 * 150, replace = T)) %>% 
  group_by(group) %>% 
  mutate(period = row_number()) %>% 
  ungroup() 

dat <- cbind(foo,  model.matrix(~ group- 1, foo)) %>% 
  select(-group)

window.length <- 115

timecontrol   <- trainControl(
  method            = 'timeslice',
  initialWindow     = window.length * length(unique(foo$group)),
  horizon           = length(unique(foo$group)),
  skip              = length(unique(foo$group)),
  selectionFunction = "best",
  fixedWindow       = TRUE,
  savePredictions   = 'final'
)

model_names <- c("xgbTree", "earth", "cubist")
fits <- map(model_names,
            ~ train(
              y ~ . - 1,
              data = dat,
              method = .x,
              trControl = timecontrol
            )) %>% 
  set_names(model_names)

169

answered Oct 05 '22 17:10

Giovanni Colitti

Related questions
                            
                                Prevent title space changing when animating with descender letters
                            
                                R Shiny: Use reactiveValues() with data.table assign-by-reference
                            
                                White space from datatable screenshot in Rmarkdown PDF
                            
                                Interactive plots on local .html via .rmd or Shiny
                            
                                Implement R package TSdist from python
                            
                                R code inside math notation R Markdown
                            
                                R: trouble with mle() error: non-finite finite-difference value [2]
                            
                                R and Rscript give different results for datetime
                            
                                Formatting multiple columns with flextable r package
                            
                                How to perform piece wise/spline regression for longitudinal temperature series in R (New Update)?
                            
                                Incorrect columnname displayed in dataTableOutput, when selectinput(multiple=T) - shiny
                            
                                Is there R command(s) making Keras Tensorflow-GPU to run on CPU?
                            
                                How to draw directional spider network in geom_segment/ggplot2 in R?
                            
                                Install and use RPy2 (using conda) so that it uses default R installation in /usr/lib/R R
                            
                                Using objects inside list as function arguments in lapply
                            
                                Shiny modularized inputs inside pop-up modal aren't being written to reactiveValues when dismissed [flexdashboard/shinydashboard]
                            
                                Extract text and links from unbalanced html table
                            
                                Rstudio does not stop at breakpoint
                            
                                subcomponent(mode = "in") for multiple source vertices
                            
                                Getting connection timed out error while GeoCoding in R

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Train time series models in caret by group

Tags:

r

time-series

training-data

r-caret

Felipe Alvarenga

People also ask

1 Answers

Giovanni Colitti

Recent Activity

Donate For Us