Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to create a lag variable within each group?

I have a data.table:

set.seed(1) data <- data.table(time = c(1:3, 1:4),                    groups = c(rep(c("b", "a"), c(3, 4))),                    value = rnorm(7))  data #    groups time      value # 1:      b    1 -0.6264538 # 2:      b    2  0.1836433 # 3:      b    3 -0.8356286 # 4:      a    1  1.5952808 # 5:      a    2  0.3295078 # 6:      a    3 -0.8204684 # 7:      a    4  0.4874291 

I want to compute a lagged version of the "value" column, within each level of "groups".

The result should look like

#   groups time      value  lag.value # 1      a    1  1.5952808         NA # 2      a    2  0.3295078  1.5952808 # 3      a    3 -0.8204684  0.3295078 # 4      a    4  0.4874291 -0.8204684 # 5      b    1 -0.6264538         NA # 6      b    2  0.1836433 -0.6264538 # 7      b    3 -0.8356286  0.1836433 

I have tried to use lag directly:

data$lag.value <- lag(data$value)  

...which clearly wouldn't work.

I have also tried:

unlist(tapply(data$value, data$groups, lag))  a1         a2         a3         a4         b1         b2         b3   NA -0.1162932  0.4420753  2.1505440         NA  0.5894583 -0.2890288  

Which is almost what I want. However the vector generated is ordered differently from the ordering in the data.table which is problematic.

What is the most efficient way to do this in base R, plyr, dplyr, and data.table?

like image 291
xiaodai Avatar asked Oct 10 '14 04:10

xiaodai


People also ask

How do you lag variables in SAS?

You can use the LAG function in SAS to retrieve lagged values of some variable. This function uses the following basic syntax: lag1_value = lag(value); By default, lag finds the previous value of some variable.


2 Answers

You could do this within data.table

 library(data.table)  data[, lag.value:=c(NA, value[-.N]), by=groups]   data  #   time groups       value   lag.value  #1:    1      a  0.02779005          NA  #2:    2      a  0.88029938  0.02779005  #3:    3      a -1.69514201  0.88029938  #4:    1      b -1.27560288          NA  #5:    2      b -0.65976434 -1.27560288  #6:    3      b -1.37804943 -0.65976434  #7:    4      b  0.12041778 -1.37804943 

For multiple columns:

nm1 <- grep("^value", colnames(data), value=TRUE) nm2 <- paste("lag", nm1, sep=".") data[, (nm2):=lapply(.SD, function(x) c(NA, x[-.N])), by=groups, .SDcols=nm1]  data #    time groups      value     value1      value2  lag.value lag.value1 #1:    1      b -0.6264538  0.7383247  1.12493092         NA         NA #2:    2      b  0.1836433  0.5757814 -0.04493361 -0.6264538  0.7383247 #3:    3      b -0.8356286 -0.3053884 -0.01619026  0.1836433  0.5757814 #4:    1      a  1.5952808  1.5117812  0.94383621         NA         NA #5:    2      a  0.3295078  0.3898432  0.82122120  1.5952808  1.5117812 #6:    3      a -0.8204684 -0.6212406  0.59390132  0.3295078  0.3898432 #7:    4      a  0.4874291 -2.2146999  0.91897737 -0.8204684 -0.6212406 #    lag.value2 #1:          NA #2:  1.12493092 #3: -0.04493361 #4:          NA #5:  0.94383621 #6:  0.82122120 #7:  0.59390132 

Update

From data.table versions >= v1.9.5, we can use shift with type as lag or lead. By default, the type is lag.

data[, (nm2) :=  shift(.SD), by=groups, .SDcols=nm1] #   time groups      value     value1      value2  lag.value lag.value1 #1:    1      b -0.6264538  0.7383247  1.12493092         NA         NA #2:    2      b  0.1836433  0.5757814 -0.04493361 -0.6264538  0.7383247 #3:    3      b -0.8356286 -0.3053884 -0.01619026  0.1836433  0.5757814 #4:    1      a  1.5952808  1.5117812  0.94383621         NA         NA #5:    2      a  0.3295078  0.3898432  0.82122120  1.5952808  1.5117812 #6:    3      a -0.8204684 -0.6212406  0.59390132  0.3295078  0.3898432 #7:    4      a  0.4874291 -2.2146999  0.91897737 -0.8204684 -0.6212406 #    lag.value2 #1:          NA #2:  1.12493092 #3: -0.04493361 #4:          NA #5:  0.94383621 #6:  0.82122120 #7:  0.59390132 

If you need the reverse, use type=lead

nm3 <- paste("lead", nm1, sep=".") 

Using the original dataset

  data[, (nm3) := shift(.SD, type='lead'), by = groups, .SDcols=nm1]   #  time groups      value     value1      value2 lead.value lead.value1   #1:    1      b -0.6264538  0.7383247  1.12493092  0.1836433   0.5757814   #2:    2      b  0.1836433  0.5757814 -0.04493361 -0.8356286  -0.3053884   #3:    3      b -0.8356286 -0.3053884 -0.01619026         NA          NA   #4:    1      a  1.5952808  1.5117812  0.94383621  0.3295078   0.3898432   #5:    2      a  0.3295078  0.3898432  0.82122120 -0.8204684  -0.6212406   #6:    3      a -0.8204684 -0.6212406  0.59390132  0.4874291  -2.2146999   #7:    4      a  0.4874291 -2.2146999  0.91897737         NA          NA  #   lead.value2  #1: -0.04493361  #2: -0.01619026  #3:          NA  #4:  0.82122120  #5:  0.59390132  #6:  0.91897737  #7:          NA 

data

 set.seed(1)  data <- data.table(time =c(1:3,1:4),groups = c(rep(c("b","a"),c(3,4))),              value = rnorm(7), value1=rnorm(7), value2=rnorm(7)) 
like image 148
akrun Avatar answered Nov 06 '22 04:11

akrun


Using package dplyr:

library(dplyr) data <-      data %>%     group_by(groups) %>%     mutate(lag.value = dplyr::lag(value, n = 1, default = NA)) 

gives

> data Source: local data table [7 x 4] Groups: groups    time groups       value   lag.value 1    1      a  0.07614866          NA 2    2      a -0.02784712  0.07614866 3    3      a  1.88612245 -0.02784712 4    1      b  0.26526825          NA 5    2      b  1.23820506  0.26526825 6    3      b  0.09276648  1.23820506 7    4      b -0.09253594  0.09276648 

As noted by @BrianD, this implicitly assumes that value is sorted by group already. If not, either sort it by group, or use the order_by argument in lag. Also note that due to an existing issue with some versions of dplyr, for safety, arguments and the namespace should be explicitly given.

like image 33
Alex Avatar answered Nov 06 '22 04:11

Alex