Lets say I have a data table like this.
customer_id time_stamp value
1: 1 223 4
2: 1 252 1
3: 1 456 3
4: 2 455 5
5: 2 632 2
So that customer_id and time_stamp together form a unique key. I want to add some new columns indicating the previous and last values of "value". That is, I want output like this.
customer_id time_stamp value value_PREV value_NEXT
1: 1 223 4 NA 1
2: 1 252 1 4 3
3: 1 456 3 1 NA
4: 2 455 5 NA 2
5: 2 632 2 5 NA
I want this to be fast and work with sparse, irregular times. I thought that the data.table rolling join would do it for me. However the rolling join appears to find the last time OR same time. So if you do a rolling join on two copies of the same table (after adding _PREV to the column names of the copy), this doesn't quite work. You can fudge it by adding a tiny number to the time variable of the copy but this is kinda awkward.
Is there a way to do this simply with rollin join or some other data.table method? I've found an efficient way but it still requires about 40 lines of R code. It seems that this could be a one-liner if rolling join could be told to look for the last time NOT including the same time. Or maybe there is some other neat trick.
Here is the example data.
data=data.table(customer_id=c(1,2,1,1,2),time_stamp=c(252,632,456,223,455),value=c(1,2,3,4,5))
data_sorted=data[order(customer_id,time_stamp)]
This is the code I wrote. Note that the lines putting NA into the ones where customer_id differ throws a warning and probably needs changing. I have them commented out below. Anyone have any suggestions for replacing those two lines?
add_prev_next_cbind<-function(data,ident="customer_id",timecol="time_stamp",prev_tag="PREV",
next_tag="NEXT",sep="_"){
o=order(data[[ident]],data[[timecol]])
uo=order(o)
data=data[o,]
Nrow=nrow(data)
Ncol=ncol(data)
#shift it, put any junk in the first row
data_prev=data[c(1,1:(Nrow-1)),]
#shift it, put any junk in the last row
data_next=data[c(2:(Nrow),Nrow),]
#flag the rows where the identity changes, these get NA
prev_diff=data[[ident]] != data_prev[[ident]]
prev_diff[1]=T
next_diff=data[[ident]] != data_next[[ident]]
next_diff[Nrow]=T
#change names
names=names(data)
names_prev=paste(names,prev_tag,sep=sep)
names_next=paste(names,next_tag,sep=sep)
setnames(data_prev,names,names_prev)
setnames(data_next,names,names_next)
#put NA in rows where prev and next are from a different ident
#replace the next two lines with something else
#data_prev[prev_diff,]<-NA
#data_next[next_diff,]<-NA
data_all=cbind(data,data_prev,data_next)
data_all=data_all[uo,]
return(data_all)
}
Rolling join, known also as last observation carried forward (LOCF), is an inequality join of two tables.
If you want to join by multiple variables, then you need to specify a vector of variable names: by = c("var1", "var2", "var3") . Here all three columns must match in both tables. If you want to use all variables that appear in both tables, then you can leave the by argument blank.
table (or character vector), the columns to join by must be specified using 'on=' argument (see ? data. table), by keying x (i.e. sorted, and, marked as sorted, see ? setkey), or by sharing column names between x and i (i.e., a natural join).
- New function
shift()
implements fastlead/lag
of vector, list, data.frames or data.tables. It takes atype
argument which can be either "lag" (default) or "lead" and always returns a list, which makes it very convenient to use it along with:=
orset()
. For example:DT[, (cols) := shift(.SD, 1L), by=id]
. Please have a look at?shift
for more info.
Now we can therefore do:
dt[, c("value_PREV", "value_NEXT") := c(shift(value, 1L, type="lag"),
shift(value, 1L, type="lead")), by=customer_id]
You don't need a roll join here at all. you can do this with head
and tail
. Assuming your data.table
is DT:
setkey(DT, "customer_id")
DT[, list(time_stamp = time_stamp,
prev.val = c(NA, head(value, -1)),
next.val = c(tail(value, -1), NA)),
by=customer_id]
# customer_id time_stamp prev.val next.val
# 1: 1 223 NA 1
# 2: 1 252 4 3
# 3: 1 456 1 NA
# 4: 2 455 NA 2
# 5: 2 632 5 NA
Edit: Even better:
DT[, `:=`(prev.val = c(NA, head(value, -1)),
next.val = c(tail(value, -1), NA)),
by=customer_id]
Yes if I don't want roll
to equimatch then I also take a little bit off if it's type double, or work with integer and add or subtract 1L.
DT = data.table( customer_id=c(1,2,1,1,2),
time_stamp=as.integer(c(252,632,456,223,455)),
value=c(1,2,3,4,5))
setkey(DT, customer_id, time_stamp)
DT[ DT[,list(customer_id,time_stamp+1L,value)], value_PREV:=i.value, roll=-Inf]
DT[ DT[,list(customer_id,time_stamp-1L,value)], value_NEXT:=i.value, roll=+Inf]
DT
customer_id time_stamp value value_PREV value_NEXT
1: 1 223 4 NA 1
2: 1 252 1 4 3
3: 1 456 3 1 NA
4: 2 455 5 NA 2
5: 2 632 2 5 NA
To have to take a column subset of DT
again in i
like that is a bit awkward, I agree.
Have now filed FR#2628 to add a new argument rollequal=TRUE|FALSE
. Then it would be :
setkey(DT, customer_id, time_stamp)
DT[ DT, value_PREV:=i.value, roll=-Inf, rollequal=FALSE]
DT[ DT, value_NEXT:=i.value, roll=+Inf, rollequal=FALSE]
That would be faster too by avoiding the copy of the i
columns and not needing to allocate for time_stamp-1L
and time_stamp+1L
.
But in this case, it's a self join from DT
to DT
and DT
's key is unique, so as Arun says, a roll
join isn't needed. Maybe a fast shift or lag function is needed to avoid the overhead of c()
and head()
or tail()
, for speed.
Thanks for highlighting!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With