The do-function in dplyr lets you make a lot of cool models fast and easy, but I am struggeling to use these models for good rolling forecasts.
# Data illustration
require(dplyr)
require(forecast)
df <- data.frame(
Date = seq.POSIXt(from = as.POSIXct("2015-01-01 00:00:00"),
to = as.POSIXct("2015-06-30 00:00:00"), by = "hour"))
df <- df %>% mutate(Hour = as.numeric(format(Date, "%H")) + 1,
Wind = runif(4320, min = 1, max = 5000),
Temp = runif(4320, min = - 20, max = 25),
Price = runif(4320, min = -15, max = 45)
)
My factor variable is Hour
, my exogenous variables are Wind
and temp
, and the thing I want to forecast is Price
. So, basically, I have 24 models I would like to be able to do rolling forecasts with.
Now, my data frame contains 180 days. I would like to go back 100 days, and do a 1 day rolling forecast and then be able to compare this to the actual Price
.
Doing this brute force would look something like this:
# First I fit the data frame to be exactly the right length
# 100 days to start with (2015-03-21 or so), then 99, then 98.., etc.
n <- 100 * 24
# Make the price <- NA so I can replace it with a forecast
df$Price[(nrow(df) - n): (nrow(df) - n + 24)] <- NA
# Now I make df just 81 days long, the estimation period + the first forecast
df <- df[1 : (nrow(df) - n + 24), ]
# The actual do & fit, later termed fx(df)
result <- df %>% group_by(Hour) %>% do ({
historical <- .[!is.na(.$Price), ]
forecasted <- .[is.na(.$Price), c("Date", "Hour", "Wind", "Temp")]
fit <- Arima(historical$Price, xreg = historical[, 3:4], order = c(1, 1, 0))
data.frame(forecasted[],
Price = forecast.Arima(fit, xreg = forecasted[3:4])$mean )
})
result
Now I would change n
to 99 * 24. But it would be awesome to have this in a loop or apply, but I simply can't figure out how to do it, and also save each new forecast.
I've tried a loop like this, but no luck yet:
# 100 days ago, forecast that day, then the next, etc.
for (n in 1:100) {
nx <- n * 24 * 80 # Because I want to start after 80 days
df[nx:(nx + 23), 5] <- NA # Set prices to NA so I can forecast them
fx(df) # do the function
df.results[n] <- # Write the results into a vector / data frame to save them
# and now rinse and repeat for n + 1
}
Truly awesome bonus-points for a broom
-like solution :)
I'll start off by noting that there is an error present in your for loop. Instead of n*24*80
you probably meant (n+80)*24
. The counter in your loop should also go from 0 to 99 instead of 1 to 100 if you want to include the prediction for the 81st day as well.
I'll try to provide an elegant solution for your problem below. First, we define our test dataframe in the exact same way that you have done in your post:
set.seed(2)
df <- data.frame(
Date = seq.POSIXt(from = as.POSIXct("2015-01-01 00:00:00"),
to = as.POSIXct("2015-06-30 00:00:00"), by = "hour"))
df <- df %>% mutate(Hour = as.numeric(format(Date, "%H")) + 1,
Wind = runif(4320, min = 1, max = 5000),
Temp = runif(4320, min = - 20, max = 25),
Price = runif(4320, min = -15, max = 45)
)
Next, we define a function that performs the prediction for one particular day. Input arguments are the dataframe under consideration and the minimal number of trainingdays that should be in the trainingset (=80 in this example). minTrainingDays+offSet+1
represents the actual day that we are predicting. Note that we start counting from 0 for the offset.
forecastOneDay <- function(theData,minTrainingDays,offset)
{
nrTrainingRows <- (minTrainingDays+offset)*24
theForecast <- theData %>%
filter(min_rank(Date) <= nrTrainingRows+24) %>% # Drop future data that we don't need
group_by(Hour) %>%
do ({
trainingData <- head(.,-1) # For each group, drop the last entry from the dataframe
forecastData <- tail(.,1) %>% select(Date,Hour,Wind,Temp) # For each group, predict the last entry
fit <- Arima(trainingData$Price, xreg=trainingData[,3:4], order=c(1,1,0))
data.frame(forecastData, realPrice = tail(.,1)$Price, predictedPrice = forecast.Arima(fit,xreg=forecastData[3:4])$mean)
})
}
We want to predict days 81-180. In other words, we need a minimum of 80 days in our training set and want to compute the function results for offsets 0:99
. This can be accomplished with a simple lapply
call. We end by merging all the results together in a dataframe:
# Perform one day forecasts for days 81-180
resultList <- lapply(0:99, function(x) forecastOneDay(df,80,x))
# Merge all the results
mergedForecasts <- do.call("rbind",resultList)
EDIT After reviewing your post and another answer that was posted in the meantime I noticed two potential problems with my answer. First, you wanted a rolling window of 80 days training data. However, In my previous code all the available training data is being used to fit the model instead of going back only 80 days. Second, the code is not robust to DST changes.
These two issues are fixed in the code below. The inputs of the function are also more intuitive now: The number of trainingdays and the actual predicted day can be used as inputarguments. Note that the POSIXlt
data format handles things like DST, leap years etc correctly when performing operations on the dates. Because the dates in your dataframe are of type POSIXct
we need to do a small type conversion back and forth in order to handle things properly.
New code below:
forecastOneDay <- function(theData,nrTrainingDays,predictDay) # predictDay should be greater than nrTrainingDays
{
initialDate <- as.POSIXlt(theData$Date[1]); # First day (midnight hour)
startDate <- initialDate # Beginning of training interval
endDate <- initialDate # End of test interval
startDate$mday <- initialDate$mday + (predictDay-nrTrainingDays-1) # Go back 80 days from predictday
endDate$mday <- startDate$mday + (nrTrainingDays+1) # +1 to include prediction day
theForecast <- theData %>%
filter(Date >= as.POSIXct(startDate),Date < as.POSIXct(endDate)) %>%
group_by(Hour) %>%
do ({
trainingData <- head(.,-1) # For each group, drop the last entry from the dataframe
forecastData <- tail(.,1) %>% select(Date,Hour,Wind,Temp) # For each group, predict the last entry
fit <- Arima(trainingData$Price, xreg=trainingData[,3:4], order=c(1,1,0))
data.frame(forecastData, realPrice = tail(.,1)$Price, predictedPrice = forecast.Arima(fit,xreg=forecastData[3:4])$mean)
})
}
# Perform one day forecasts for days 81-180
resultList <- lapply(81:180, function(x) forecastOneDay(df,80,x))
# Merge all the results
mergedForecasts <- do.call("rbind",resultList)
Results look like this:
> head(mergedForecasts)
Source: local data frame [6 x 6]
Groups: Hour
Date Hour Wind Temp realPrice predictedPrice
1 2015-03-22 00:00:00 1 1691.589 -8.722152 -11.207139 5.918541
2 2015-03-22 01:00:00 2 1790.928 18.098358 3.902686 37.885532
3 2015-03-22 02:00:00 3 1457.195 10.166422 22.193270 34.984164
4 2015-03-22 03:00:00 4 1414.502 4.993783 6.370435 12.037642
5 2015-03-22 04:00:00 5 3020.755 9.540715 25.440357 -1.030102
6 2015-03-22 05:00:00 6 4102.651 2.446729 33.528199 39.607848
> tail(mergedForecasts)
Source: local data frame [6 x 6]
Groups: Hour
Date Hour Wind Temp realPrice predictedPrice
1 2015-06-29 18:00:00 19 1521.9609 13.6414797 12.884175 -6.7789109
2 2015-06-29 19:00:00 20 555.1534 3.4758159 37.958768 -5.1193514
3 2015-06-29 20:00:00 21 4337.6605 4.7242352 -9.244882 33.6817379
4 2015-06-29 21:00:00 22 3140.1531 0.8127839 15.825230 -0.4625457
5 2015-06-29 22:00:00 23 1389.0330 20.4667234 -14.802268 15.6755880
6 2015-06-29 23:00:00 24 763.0704 9.1646139 23.407525 3.8214642
One can potentially create a "rolling" data.frame with dplyr as follows
library(dplyr)
library(lubridate)
WINDOW_SIZE_DAYS <- 80
df2 <- df %>%
mutate(Day = yday(Date)) %>%
replicate( n = WINDOW_SIZE_DAYS, simplify = FALSE ) %>%
bind_rows %>%
group_by(Date) %>%
mutate(Replica_Num = 1:n() ) %>%
mutate(Day_Group_id = Day + Replica_Num - 1 ) %>%
ungroup() %>%
group_by(Day_Group_id) %>%
filter( n() >= 24*WINDOW_SIZE_DAYS - 1 ) %>%
select( -Replica_Num ) %>%
arrange(Date) %>%
ungroup()
Basically, this code replicates observations as needed and assigns a corresponding Day_Group_id
to each 80 day chunk.
What this does is allows one to use group_by(Day_Group_id)
in order to run the model separately on each 80 day chunk.
Subsequently, one can use it as desired. For example, just copy/pasting the arima code from above as follows:
df3 <- df2 %>%
group_by(Day_Group_id, Hour) %>%
arrange(Date) %>%
do ({
trainingData <- head(.,-1) # For each group, drop the last entry from the dataframe
forecastData <- tail(.,1) %>% select(Date,Hour,Wind,Temp) # For each group, predict the last entry
fit <- Arima(trainingData$Price, xreg=trainingData[,3:4], order=c(1,1,0))
data.frame(forecastData, realPrice = tail(.,1)$Price, predictedPrice = forecast.Arima(fit,xreg=forecastData[3:4])$mean)
})
Please Note:
The filter(n() >= 24*WINDOW_SIZE_DAYS - 1)
is used here instead of filter(n() == 24*WINDOW_SIZE_DAYS)
in order to select full 80-day windows. This is due to Daylight saving time adjustment on 2015-03-08
. The hour 2015-03-08 02:00:00
does not exist in the dataset as it jumps from 2015-03-08 01:00:00
straight to 2015-03-08 03:00:00
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With