I have a data.table:
set.seed(1) data <- data.table(time = c(1:3, 1:4), groups = c(rep(c("b", "a"), c(3, 4))), value = rnorm(7)) data # groups time value # 1: b 1 -0.6264538 # 2: b 2 0.1836433 # 3: b 3 -0.8356286 # 4: a 1 1.5952808 # 5: a 2 0.3295078 # 6: a 3 -0.8204684 # 7: a 4 0.4874291
I want to compute a lagged version of the "value" column, within each level of "groups".
The result should look like
# groups time value lag.value # 1 a 1 1.5952808 NA # 2 a 2 0.3295078 1.5952808 # 3 a 3 -0.8204684 0.3295078 # 4 a 4 0.4874291 -0.8204684 # 5 b 1 -0.6264538 NA # 6 b 2 0.1836433 -0.6264538 # 7 b 3 -0.8356286 0.1836433
I have tried to use lag
directly:
data$lag.value <- lag(data$value)
...which clearly wouldn't work.
I have also tried:
unlist(tapply(data$value, data$groups, lag)) a1 a2 a3 a4 b1 b2 b3 NA -0.1162932 0.4420753 2.1505440 NA 0.5894583 -0.2890288
Which is almost what I want. However the vector generated is ordered differently from the ordering in the data.table which is problematic.
What is the most efficient way to do this in base R, plyr, dplyr, and data.table?
You can use the LAG function in SAS to retrieve lagged values of some variable. This function uses the following basic syntax: lag1_value = lag(value); By default, lag finds the previous value of some variable.
You could do this within data.table
library(data.table) data[, lag.value:=c(NA, value[-.N]), by=groups] data # time groups value lag.value #1: 1 a 0.02779005 NA #2: 2 a 0.88029938 0.02779005 #3: 3 a -1.69514201 0.88029938 #4: 1 b -1.27560288 NA #5: 2 b -0.65976434 -1.27560288 #6: 3 b -1.37804943 -0.65976434 #7: 4 b 0.12041778 -1.37804943
For multiple columns:
nm1 <- grep("^value", colnames(data), value=TRUE) nm2 <- paste("lag", nm1, sep=".") data[, (nm2):=lapply(.SD, function(x) c(NA, x[-.N])), by=groups, .SDcols=nm1] data # time groups value value1 value2 lag.value lag.value1 #1: 1 b -0.6264538 0.7383247 1.12493092 NA NA #2: 2 b 0.1836433 0.5757814 -0.04493361 -0.6264538 0.7383247 #3: 3 b -0.8356286 -0.3053884 -0.01619026 0.1836433 0.5757814 #4: 1 a 1.5952808 1.5117812 0.94383621 NA NA #5: 2 a 0.3295078 0.3898432 0.82122120 1.5952808 1.5117812 #6: 3 a -0.8204684 -0.6212406 0.59390132 0.3295078 0.3898432 #7: 4 a 0.4874291 -2.2146999 0.91897737 -0.8204684 -0.6212406 # lag.value2 #1: NA #2: 1.12493092 #3: -0.04493361 #4: NA #5: 0.94383621 #6: 0.82122120 #7: 0.59390132
From data.table
versions >= v1.9.5
, we can use shift
with type
as lag
or lead
. By default, the type is lag
.
data[, (nm2) := shift(.SD), by=groups, .SDcols=nm1] # time groups value value1 value2 lag.value lag.value1 #1: 1 b -0.6264538 0.7383247 1.12493092 NA NA #2: 2 b 0.1836433 0.5757814 -0.04493361 -0.6264538 0.7383247 #3: 3 b -0.8356286 -0.3053884 -0.01619026 0.1836433 0.5757814 #4: 1 a 1.5952808 1.5117812 0.94383621 NA NA #5: 2 a 0.3295078 0.3898432 0.82122120 1.5952808 1.5117812 #6: 3 a -0.8204684 -0.6212406 0.59390132 0.3295078 0.3898432 #7: 4 a 0.4874291 -2.2146999 0.91897737 -0.8204684 -0.6212406 # lag.value2 #1: NA #2: 1.12493092 #3: -0.04493361 #4: NA #5: 0.94383621 #6: 0.82122120 #7: 0.59390132
If you need the reverse, use type=lead
nm3 <- paste("lead", nm1, sep=".")
Using the original dataset
data[, (nm3) := shift(.SD, type='lead'), by = groups, .SDcols=nm1] # time groups value value1 value2 lead.value lead.value1 #1: 1 b -0.6264538 0.7383247 1.12493092 0.1836433 0.5757814 #2: 2 b 0.1836433 0.5757814 -0.04493361 -0.8356286 -0.3053884 #3: 3 b -0.8356286 -0.3053884 -0.01619026 NA NA #4: 1 a 1.5952808 1.5117812 0.94383621 0.3295078 0.3898432 #5: 2 a 0.3295078 0.3898432 0.82122120 -0.8204684 -0.6212406 #6: 3 a -0.8204684 -0.6212406 0.59390132 0.4874291 -2.2146999 #7: 4 a 0.4874291 -2.2146999 0.91897737 NA NA # lead.value2 #1: -0.04493361 #2: -0.01619026 #3: NA #4: 0.82122120 #5: 0.59390132 #6: 0.91897737 #7: NA
set.seed(1) data <- data.table(time =c(1:3,1:4),groups = c(rep(c("b","a"),c(3,4))), value = rnorm(7), value1=rnorm(7), value2=rnorm(7))
Using package dplyr
:
library(dplyr) data <- data %>% group_by(groups) %>% mutate(lag.value = dplyr::lag(value, n = 1, default = NA))
gives
> data Source: local data table [7 x 4] Groups: groups time groups value lag.value 1 1 a 0.07614866 NA 2 2 a -0.02784712 0.07614866 3 3 a 1.88612245 -0.02784712 4 1 b 0.26526825 NA 5 2 b 1.23820506 0.26526825 6 3 b 0.09276648 1.23820506 7 4 b -0.09253594 0.09276648
As noted by @BrianD, this implicitly assumes that value is sorted by group already. If not, either sort it by group, or use the order_by
argument in lag
. Also note that due to an existing issue with some versions of dplyr, for safety, arguments and the namespace should be explicitly given.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With