Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Multi-steps forecasting with dplyr and do

The do-function in dplyr lets you make a lot of cool models fast and easy, but I am struggeling to use these models for good rolling forecasts.

# Data illustration

require(dplyr)
require(forecast)

df <- data.frame(
  Date = seq.POSIXt(from = as.POSIXct("2015-01-01 00:00:00"), 
                    to = as.POSIXct("2015-06-30 00:00:00"), by = "hour"))

  df <- df %>% mutate(Hour = as.numeric(format(Date, "%H")) + 1, 
                      Wind = runif(4320, min = 1, max = 5000), 
                      Temp = runif(4320, min = - 20, max = 25), 
                      Price = runif(4320, min = -15, max = 45)
                      )

My factor variable is Hour, my exogenous variables are Wind and temp, and the thing I want to forecast is Price. So, basically, I have 24 models I would like to be able to do rolling forecasts with.

Now, my data frame contains 180 days. I would like to go back 100 days, and do a 1 day rolling forecast and then be able to compare this to the actual Price.

Doing this brute force would look something like this:

# First I fit the data frame to be exactly the right length
# 100 days to start with (2015-03-21 or so), then 99, then 98.., etc. 
n <- 100 * 24

# Make the price <- NA so I can replace it with a forecast
df$Price[(nrow(df) - n): (nrow(df) - n + 24)] <- NA

# Now I make df just 81 days long, the estimation period + the first forecast
df <- df[1 : (nrow(df) - n + 24), ]

# The actual do & fit, later termed fx(df)

result <- df %>% group_by(Hour) %>% do ({
  historical <- .[!is.na(.$Price), ]
  forecasted <- .[is.na(.$Price), c("Date", "Hour", "Wind", "Temp")]
  fit <- Arima(historical$Price, xreg = historical[, 3:4], order = c(1, 1, 0))
  data.frame(forecasted[], 
             Price = forecast.Arima(fit, xreg = forecasted[3:4])$mean )
})

result

Now I would change n to 99 * 24. But it would be awesome to have this in a loop or apply, but I simply can't figure out how to do it, and also save each new forecast.

I've tried a loop like this, but no luck yet:

# 100 days ago, forecast that day, then the next, etc.
for (n in 1:100) { 
  nx <- n * 24 * 80         # Because I want to start after 80 days
  df[nx:(nx + 23), 5] <- NA # Set prices to NA so I can forecast them
  fx(df) # do the function
  df.results[n] <- # Write the results into a vector / data frame to save them
    # and now rinse and repeat for n + 1
  }

Truly awesome bonus-points for a broom-like solution :)

like image 805
Thorst Avatar asked Jun 30 '15 12:06

Thorst


2 Answers

I'll start off by noting that there is an error present in your for loop. Instead of n*24*80 you probably meant (n+80)*24. The counter in your loop should also go from 0 to 99 instead of 1 to 100 if you want to include the prediction for the 81st day as well.

I'll try to provide an elegant solution for your problem below. First, we define our test dataframe in the exact same way that you have done in your post:

set.seed(2)
df <- data.frame(
Date = seq.POSIXt(from = as.POSIXct("2015-01-01 00:00:00"), 
                    to = as.POSIXct("2015-06-30 00:00:00"), by = "hour"))
df <- df %>% mutate(Hour = as.numeric(format(Date, "%H")) + 1, 
                    Wind = runif(4320, min = 1, max = 5000), 
                    Temp = runif(4320, min = - 20, max = 25), 
                    Price = runif(4320, min = -15, max = 45)
)

Next, we define a function that performs the prediction for one particular day. Input arguments are the dataframe under consideration and the minimal number of trainingdays that should be in the trainingset (=80 in this example). minTrainingDays+offSet+1 represents the actual day that we are predicting. Note that we start counting from 0 for the offset.

forecastOneDay <- function(theData,minTrainingDays,offset)
{
  nrTrainingRows <- (minTrainingDays+offset)*24

  theForecast <- theData %>% 
    filter(min_rank(Date) <= nrTrainingRows+24) %>% # Drop future data that we don't need
    group_by(Hour) %>%
    do ({
      trainingData <- head(.,-1) # For each group, drop the last entry from the dataframe
      forecastData <- tail(.,1) %>% select(Date,Hour,Wind,Temp) # For each group, predict the last entry
      fit <- Arima(trainingData$Price, xreg=trainingData[,3:4], order=c(1,1,0))
      data.frame(forecastData, realPrice = tail(.,1)$Price, predictedPrice = forecast.Arima(fit,xreg=forecastData[3:4])$mean)
    })
}

We want to predict days 81-180. In other words, we need a minimum of 80 days in our training set and want to compute the function results for offsets 0:99. This can be accomplished with a simple lapply call. We end by merging all the results together in a dataframe:

# Perform one day forecasts for days 81-180
resultList <- lapply(0:99, function(x) forecastOneDay(df,80,x))
# Merge all the results
mergedForecasts <- do.call("rbind",resultList)

EDIT After reviewing your post and another answer that was posted in the meantime I noticed two potential problems with my answer. First, you wanted a rolling window of 80 days training data. However, In my previous code all the available training data is being used to fit the model instead of going back only 80 days. Second, the code is not robust to DST changes.

These two issues are fixed in the code below. The inputs of the function are also more intuitive now: The number of trainingdays and the actual predicted day can be used as inputarguments. Note that the POSIXlt data format handles things like DST, leap years etc correctly when performing operations on the dates. Because the dates in your dataframe are of type POSIXct we need to do a small type conversion back and forth in order to handle things properly.

New code below:

forecastOneDay <- function(theData,nrTrainingDays,predictDay) # predictDay should be greater than nrTrainingDays
{
  initialDate <- as.POSIXlt(theData$Date[1]); # First day (midnight hour)
  startDate <- initialDate # Beginning of training interval
  endDate <- initialDate # End of test interval

  startDate$mday <- initialDate$mday + (predictDay-nrTrainingDays-1) # Go back 80 days from predictday
  endDate$mday <- startDate$mday + (nrTrainingDays+1) # +1 to include prediction day

  theForecast <- theData %>% 
    filter(Date >= as.POSIXct(startDate),Date < as.POSIXct(endDate)) %>% 
    group_by(Hour) %>%
    do ({
      trainingData <- head(.,-1) # For each group, drop the last entry from the dataframe
      forecastData <- tail(.,1) %>% select(Date,Hour,Wind,Temp) # For each group, predict the last entry
      fit <- Arima(trainingData$Price, xreg=trainingData[,3:4], order=c(1,1,0))
      data.frame(forecastData, realPrice = tail(.,1)$Price, predictedPrice = forecast.Arima(fit,xreg=forecastData[3:4])$mean)
    })
}

# Perform one day forecasts for days 81-180
resultList <- lapply(81:180, function(x) forecastOneDay(df,80,x))
# Merge all the results
mergedForecasts <- do.call("rbind",resultList)

Results look like this:

> head(mergedForecasts)
Source: local data frame [6 x 6]
Groups: Hour

                 Date Hour     Wind      Temp  realPrice predictedPrice
1 2015-03-22 00:00:00    1 1691.589 -8.722152 -11.207139       5.918541
2 2015-03-22 01:00:00    2 1790.928 18.098358   3.902686      37.885532
3 2015-03-22 02:00:00    3 1457.195 10.166422  22.193270      34.984164
4 2015-03-22 03:00:00    4 1414.502  4.993783   6.370435      12.037642
5 2015-03-22 04:00:00    5 3020.755  9.540715  25.440357      -1.030102
6 2015-03-22 05:00:00    6 4102.651  2.446729  33.528199      39.607848
> tail(mergedForecasts)
Source: local data frame [6 x 6]
Groups: Hour

                 Date Hour      Wind       Temp  realPrice predictedPrice
1 2015-06-29 18:00:00   19 1521.9609 13.6414797  12.884175     -6.7789109
2 2015-06-29 19:00:00   20  555.1534  3.4758159  37.958768     -5.1193514
3 2015-06-29 20:00:00   21 4337.6605  4.7242352  -9.244882     33.6817379
4 2015-06-29 21:00:00   22 3140.1531  0.8127839  15.825230     -0.4625457
5 2015-06-29 22:00:00   23 1389.0330 20.4667234 -14.802268     15.6755880
6 2015-06-29 23:00:00   24  763.0704  9.1646139  23.407525      3.8214642
like image 173
Jellen Vermeir Avatar answered Oct 10 '22 17:10

Jellen Vermeir


One can potentially create a "rolling" data.frame with dplyr as follows

library(dplyr)
library(lubridate)

WINDOW_SIZE_DAYS <- 80

df2 <- df %>%
  mutate(Day = yday(Date)) %>%
  replicate( n = WINDOW_SIZE_DAYS, simplify = FALSE ) %>% 
  bind_rows %>%
  group_by(Date) %>%
  mutate(Replica_Num = 1:n() ) %>%
  mutate(Day_Group_id = Day + Replica_Num - 1 ) %>%
  ungroup() %>%
  group_by(Day_Group_id) %>%
  filter( n() >= 24*WINDOW_SIZE_DAYS - 1 ) %>%
  select( -Replica_Num ) %>%
  arrange(Date) %>%
  ungroup()

Basically, this code replicates observations as needed and assigns a corresponding Day_Group_id to each 80 day chunk. What this does is allows one to use group_by(Day_Group_id) in order to run the model separately on each 80 day chunk.

Subsequently, one can use it as desired. For example, just copy/pasting the arima code from above as follows:

df3 <- df2 %>%
  group_by(Day_Group_id, Hour) %>%
  arrange(Date) %>%
  do ({
    trainingData <- head(.,-1) # For each group, drop the last entry from the dataframe
    forecastData <- tail(.,1) %>% select(Date,Hour,Wind,Temp) # For each group, predict the last entry
    fit <- Arima(trainingData$Price, xreg=trainingData[,3:4], order=c(1,1,0))
    data.frame(forecastData, realPrice = tail(.,1)$Price, predictedPrice = forecast.Arima(fit,xreg=forecastData[3:4])$mean)
  })

Please Note:

The filter(n() >= 24*WINDOW_SIZE_DAYS - 1) is used here instead of filter(n() == 24*WINDOW_SIZE_DAYS) in order to select full 80-day windows. This is due to Daylight saving time adjustment on 2015-03-08. The hour 2015-03-08 02:00:00 does not exist in the dataset as it jumps from 2015-03-08 01:00:00 straight to 2015-03-08 03:00:00.

like image 44
akhmed Avatar answered Oct 10 '22 17:10

akhmed