I'm using lm on a time series, which works quite well actually, and it's super super fast.
Let's say my model is:
> formula <- y ~ x
I train this on a training set:
> train <- data.frame( x = seq(1,3), y = c(2,1,4) ) > model <- lm( formula, train )
... and I can make predictions for new data:
> test <- data.frame( x = seq(4,6) ) > test$y <- predict( model, newdata = test ) > test x y 1 4 4.333333 2 5 5.333333 3 6 6.333333
This works super nicely, and it's really speedy.
I want to add lagged variables to the model. Now, I could do this by augmenting my original training set:
> train$y_1 <- c(0,train$y[1:nrow(train)-1]) > train x y y_1 1 1 2 0 2 2 1 2 3 3 4 1
update the formula:
formula <- y ~ x * y_1
... and training will work just fine:
> model <- lm( formula, train ) > # no errors here
However, the problem is that there is no way of using 'predict', because there is no way of populating y_1 in a test set in a batch manner.
Now, for lots of other regression things, there are very convenient ways to express them in the formula, such as poly(x,2)
and so on, and these work directly using the unmodified training and test data.
So, I'm wondering if there is some way of expressing lagged variables in the formula, so that predict
can be used? Ideally:
formula <- y ~ x * lag(y,-1) model <- lm( formula, train ) test$y <- predict( model, newdata = test )
... without having to augment (not sure if that's the right word) the training and test datasets, and just being able to use predict
directly?
Lagged dependent variables (LDVs) have been used in regression analysis to provide robust estimates of the effects of independent variables, but some research argues that using LDVs in regressions produces negatively biased coefficient estimates, even if the LDV is part of the data-generating process.
When the current value of your dependant variable depends on the past value(s), you add it as an explanatory variable, e.g. how much dividend was given in the previous year affects the dividend decision of the current year. That's why we add its lagged values in the model.
A dependent variable that is lagged in time. For example, if Yt is the dependent variable, then Yt-1 will be a lagged dependent variable with a lag of one period. Lagged values are used in Dynamic Regression modeling.
Have a look at e.g. the dynlm package which gives you lag operators. More generally the Task Views on Econometrics and Time Series will have lots more for you to look at.
Here is the beginning of its examples -- a one and twelve month lag:
R> data("UKDriverDeaths", package = "datasets") R> uk <- log10(UKDriverDeaths) R> dfm <- dynlm(uk ~ L(uk, 1) + L(uk, 12)) R> dfm Time series regression with "ts" data: Start = 1970(1), End = 1984(12) Call: dynlm(formula = uk ~ L(uk, 1) + L(uk, 12)) Coefficients: (Intercept) L(uk, 1) L(uk, 12) 0.183 0.431 0.511 R>
Following Dirk's suggestion on dynlm
, I couldn't quite figure out how to predict, but searching for that led me to dyn
package via https://stats.stackexchange.com/questions/6758/1-step-ahead-predictions-with-dynlm-r-package
Then after several hours of experimentation I came up with the following function to handle the prediction. There were quite a few 'gotcha's on the way, eg you can't seem to rbind
time series, and the result of predict is offset by start
and a whole bunch of things like that, so I feel this answer adds significantly compared to just naming a package, though I have upvoted Dirk's answer.
So, a solution that works is:
dyn
packagepredictDyn method:
# pass in training data, test data, # it will step through one by one # need to give dependent var name, so that it can make this into a timeseries predictDyn <- function( model, train, test, dependentvarname ) { Ntrain <- nrow(train) Ntest <- nrow(test) # can't rbind ts's apparently, so convert to numeric first train[,dependentvarname] <- as.numeric(train[,dependentvarname]) test[,dependentvarname] <- as.numeric(test[,dependentvarname]) testtraindata <- rbind( train, test ) testtraindata[,dependentvarname] <- ts( as.numeric( testtraindata[,dependentvarname] ) ) for( i in 1:Ntest ) { result <- predict(model,newdata=testtraindata,subset=1:(Ntrain+i-1)) testtraindata[Ntrain+i,dependentvarname] <- result[Ntrain + i + 1 - start(result)][1] } return( testtraindata[(Ntrain+1):(Ntrain + Ntest),] ) }
Example usage:
library("dyn") # size of training and test data N <- 6 predictN <- 10 # create training data, which we can get exact fit on, so we can check the results easily traindata <- c(1,2) for( i in 3:N ) { traindata[i] <- 0.5 + 1.3 * traindata[i-2] + 1.7 * traindata[i-1] } train <- data.frame( y = ts( traindata ), foo = 1) # create testing data, bunch of NAs test <- data.frame( y = ts( rep(NA,predictN) ), foo = 1) # fit a model model <- dyn$lm( y ~ lag(y,-1) + lag(y,-2), train ) # look at the model, it's a perfect fit. Nice! print(model) test <- predictDyn( model, train, test, "y" ) print(test) # nice plot plot(test$y, type='l')
Output:
> model Call: lm(formula = dyn(y ~ lag(y, -1) + lag(y, -2)), data = train) Coefficients: (Intercept) lag(y, -1) lag(y, -2) 0.5 1.7 1.3 > test y foo 7 143.2054 1 8 325.6810 1 9 740.3247 1 10 1682.4373 1 11 3823.0656 1 12 8686.8801 1 13 19738.1816 1 14 44848.3528 1 15 101902.3358 1 16 231537.3296 1
Edit: hmmm, this is super slow though. Even if I limit the data in the subset
to a constant few rows of the dataset, it takes about 24 milliseconds per prediction, or, for my task, 0.024*7*24*8*20*10/60/60
= 1.792 hours
:-O
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With