Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Adding lagged variables to an lm model?

Tags:

I'm using lm on a time series, which works quite well actually, and it's super super fast.

Let's say my model is:

> formula <- y ~ x 

I train this on a training set:

> train <- data.frame( x = seq(1,3), y = c(2,1,4) ) > model <- lm( formula, train ) 

... and I can make predictions for new data:

> test <- data.frame( x = seq(4,6) ) > test$y <- predict( model, newdata = test ) > test   x        y 1 4 4.333333 2 5 5.333333 3 6 6.333333 

This works super nicely, and it's really speedy.

I want to add lagged variables to the model. Now, I could do this by augmenting my original training set:

> train$y_1 <- c(0,train$y[1:nrow(train)-1]) > train   x y y_1 1 1 2   0 2 2 1   2 3 3 4   1 

update the formula:

formula <- y ~ x * y_1 

... and training will work just fine:

> model <- lm( formula, train ) > # no errors here 

However, the problem is that there is no way of using 'predict', because there is no way of populating y_1 in a test set in a batch manner.

Now, for lots of other regression things, there are very convenient ways to express them in the formula, such as poly(x,2) and so on, and these work directly using the unmodified training and test data.

So, I'm wondering if there is some way of expressing lagged variables in the formula, so that predict can be used? Ideally:

formula <- y ~ x * lag(y,-1) model <- lm( formula, train ) test$y <- predict( model, newdata = test ) 

... without having to augment (not sure if that's the right word) the training and test datasets, and just being able to use predict directly?

like image 914
Hugh Perkins Avatar asked Oct 27 '12 02:10

Hugh Perkins


People also ask

Why would we want to use lagged dependent variables in a regression specification?

Lagged dependent variables (LDVs) have been used in regression analysis to provide robust estimates of the effects of independent variables, but some research argues that using LDVs in regressions produces negatively biased coefficient estimates, even if the LDV is part of the data-generating process.

Why do we create lagged variables?

When the current value of your dependant variable depends on the past value(s), you add it as an explanatory variable, e.g. how much dividend was given in the previous year affects the dividend decision of the current year. That's why we add its lagged values in the model.

What are lagged variables?

A dependent variable that is lagged in time. For example, if Yt is the dependent variable, then Yt-1 will be a lagged dependent variable with a lag of one period. Lagged values are used in Dynamic Regression modeling.


2 Answers

Have a look at e.g. the dynlm package which gives you lag operators. More generally the Task Views on Econometrics and Time Series will have lots more for you to look at.

Here is the beginning of its examples -- a one and twelve month lag:

R>      data("UKDriverDeaths", package = "datasets") R>      uk <- log10(UKDriverDeaths) R>      dfm <- dynlm(uk ~ L(uk, 1) + L(uk, 12)) R>      dfm  Time series regression with "ts" data: Start = 1970(1), End = 1984(12)  Call: dynlm(formula = uk ~ L(uk, 1) + L(uk, 12))  Coefficients: (Intercept)     L(uk, 1)    L(uk, 12)         0.183        0.431        0.511    R>  
like image 50
Dirk Eddelbuettel Avatar answered Sep 19 '22 15:09

Dirk Eddelbuettel


Following Dirk's suggestion on dynlm, I couldn't quite figure out how to predict, but searching for that led me to dyn package via https://stats.stackexchange.com/questions/6758/1-step-ahead-predictions-with-dynlm-r-package

Then after several hours of experimentation I came up with the following function to handle the prediction. There were quite a few 'gotcha's on the way, eg you can't seem to rbind time series, and the result of predict is offset by start and a whole bunch of things like that, so I feel this answer adds significantly compared to just naming a package, though I have upvoted Dirk's answer.

So, a solution that works is:

  • use the dyn package
  • use the following method for prediction

predictDyn method:

# pass in training data, test data, # it will step through one by one # need to give dependent var name, so that it can make this into a timeseries predictDyn <- function( model, train, test, dependentvarname ) {     Ntrain <- nrow(train)     Ntest <- nrow(test)     # can't rbind ts's apparently, so convert to numeric first     train[,dependentvarname] <- as.numeric(train[,dependentvarname])     test[,dependentvarname] <- as.numeric(test[,dependentvarname])     testtraindata <- rbind( train, test )     testtraindata[,dependentvarname] <- ts( as.numeric( testtraindata[,dependentvarname] ) )     for( i in 1:Ntest ) {        result <- predict(model,newdata=testtraindata,subset=1:(Ntrain+i-1))        testtraindata[Ntrain+i,dependentvarname] <- result[Ntrain + i + 1 - start(result)][1]     }     return( testtraindata[(Ntrain+1):(Ntrain + Ntest),] ) } 

Example usage:

library("dyn")  # size of training and test data N <- 6 predictN <- 10  # create training data, which we can get exact fit on, so we can check the results easily traindata <- c(1,2) for( i in 3:N ) { traindata[i] <- 0.5 + 1.3 * traindata[i-2] + 1.7 * traindata[i-1] } train <- data.frame( y = ts( traindata ), foo = 1)  # create testing data, bunch of NAs test <- data.frame( y = ts( rep(NA,predictN) ), foo = 1)  # fit a model model <- dyn$lm( y ~ lag(y,-1) + lag(y,-2), train ) # look at the model, it's a perfect fit. Nice! print(model)  test <- predictDyn( model, train, test, "y" ) print(test)  # nice plot plot(test$y, type='l') 

Output:

> model  Call: lm(formula = dyn(y ~ lag(y, -1) + lag(y, -2)), data = train)  Coefficients: (Intercept)   lag(y, -1)   lag(y, -2)           0.5          1.7          1.3    > test              y foo 7     143.2054   1 8     325.6810   1 9     740.3247   1 10   1682.4373   1 11   3823.0656   1 12   8686.8801   1 13  19738.1816   1 14  44848.3528   1 15 101902.3358   1 16 231537.3296   1 

Edit: hmmm, this is super slow though. Even if I limit the data in the subset to a constant few rows of the dataset, it takes about 24 milliseconds per prediction, or, for my task, 0.024*7*24*8*20*10/60/60 = 1.792 hours :-O

like image 38
Hugh Perkins Avatar answered Sep 20 '22 15:09

Hugh Perkins