I have a data set like the following
set.seed(503)
foo <- data.table(group = rep(LETTERS[1:6], 150),
y = rnorm(n = 6 * 150, mean = 5, sd = 2),
x1 = rnorm(n = 6 * 150, mean = 5, sd = 10),
x2 = rnorm(n = 6 * 150, mean = 25, sd = 10),
x3 = rnorm(n = 6 * 150, mean = 50, sd = 10),
x4 = rnorm(n = 6 * 150, mean = 0.5, sd = 10),
x5 = sample(c(1, 0), size = 6 * 150, replace = T))
foo[, period := 1:.N, by = group]
Problem: I want to forecast y
one step ahead, for each group
, using variables x1, ..., x5
I want to run a few models in caret
to decide which I will use.
As of now, I am running it in a loop using timeslice
window.length <- 115
timecontrol <- trainControl(method = 'timeslice',
initialWindow = window.length,
horizon = 1,
selectionFunction = "best",
fixedWindow = TRUE,
savePredictions = 'final')
model_list <- list()
for(g in unique(foo$group)){
for(model in c("xgbTree", "earth", "cubist")){
dat <- foo[group == g][, c('group', 'period') := NULL]
model_list[[g]][[model]] <- train(y ~ . - 1,
data = dat,
method = model,
trControl = timecontrol)
}
}
However, I would like to run all groups at the same time, using dummy variables to identify each one, like
dat <- cbind(foo, model.matrix(~ group- 1, foo))
y x1 x2 x3 x4 x5 period groupA groupB groupC groupD groupE groupF
1: 5.710250 11.9615460 22.62916 31.04790 -4.821331e-04 1 1 1 0 0 0 0 0
2: 3.442213 8.6558983 32.41881 45.70801 3.255423e-01 1 1 0 1 0 0 0 0
3: 3.485286 7.7295448 21.99022 56.42133 8.668391e+00 1 1 0 0 1 0 0 0
4: 9.659601 0.9166456 30.34609 55.72661 -7.666063e+00 1 1 0 0 0 1 0 0
5: 5.567950 3.0306864 22.07813 52.21099 5.377153e-01 1 1 0 0 0 0 1 0
But still running the time series with the correct time ordering using timeslice
.
Is there a way to declare the time
variable in trainControl
, so my one step ahead
forecast uses, in this case, six more observations for each round and droping the first 6 observations?
I can do it by ordering the data and messing with the horizon
argument (given n
groups, order by the time variable and put horizon = n
), but this has to change if the number of groups change. And initial.window
will have to be time * n_groups
timecontrol <- trainControl(method = 'timeslice',
initialWindow = window.length * length(unique(foo$group)),
horizon = length(unique(foo$group)),
selectionFunction = "best",
fixedWindow = TRUE,
savePredictions = 'final')
Is there any ohter way?
6 Available Models | The caret Package.
The train function can generate a candidate set of parameter values and the tuneLength argument controls how many are evaluated. In the case of PLS, the function uses a sequence of integers from 1 to tuneLength . If we want to evaluate all integers between 1 and 15, setting tuneLength = 15 would achieve this.
# The tuneGrid parameter lets us decide which values the main parameter will take # While tuneLength only limit the number of default parameters to use.
tuneLength = It allows system to tune algorithm automatically. It indicates the number of different values to try for each tunning parameter. For example, mtry for randomForest. Suppose, tuneLength = 5, it means try 5 different mtry values and find the optimal mtry value based on these 5 values.
I think the answer you are looking for is actually quite simple. You can use the skip
argument to trainControl()
to skip the desired number of observations after each train/test set. In this way, you only predict each group-period once, the same period is never split between the training group and testing group, and there is no information leakage.
Using the example you provided, if you set skip = 6
and horizon = 6
(the number of groups), and initialWindow = 115
, then the first test set will include all groups for period 116, the next test set will include all groups for period 117, and so on.
library(caret)
library(tidyverse)
set.seed(503)
foo <- tibble(group = rep(LETTERS[1:6], 150),
y = rnorm(n = 6 * 150, mean = 5, sd = 2),
x1 = rnorm(n = 6 * 150, mean = 5, sd = 10),
x2 = rnorm(n = 6 * 150, mean = 25, sd = 10),
x3 = rnorm(n = 6 * 150, mean = 50, sd = 10),
x4 = rnorm(n = 6 * 150, mean = 0.5, sd = 10),
x5 = sample(c(1, 0), size = 6 * 150, replace = T)) %>%
group_by(group) %>%
mutate(period = row_number()) %>%
ungroup()
dat <- cbind(foo, model.matrix(~ group- 1, foo)) %>%
select(-group)
window.length <- 115
timecontrol <- trainControl(
method = 'timeslice',
initialWindow = window.length * length(unique(foo$group)),
horizon = length(unique(foo$group)),
skip = length(unique(foo$group)),
selectionFunction = "best",
fixedWindow = TRUE,
savePredictions = 'final'
)
model_names <- c("xgbTree", "earth", "cubist")
fits <- map(model_names,
~ train(
y ~ . - 1,
data = dat,
method = .x,
trControl = timecontrol
)) %>%
set_names(model_names)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With