Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to find the last or next entry using R package data.table and rolling joins

Tags:

r

data.table

Lets say I have a data table like this.

   customer_id time_stamp value
1:           1        223     4
2:           1        252     1
3:           1        456     3
4:           2        455     5
5:           2        632     2

So that customer_id and time_stamp together form a unique key. I want to add some new columns indicating the previous and last values of "value". That is, I want output like this.

  customer_id time_stamp value value_PREV value_NEXT
1:           1        223     4         NA          1
2:           1        252     1          4          3
3:           1        456     3          1         NA
4:           2        455     5         NA          2
5:           2        632     2          5         NA

I want this to be fast and work with sparse, irregular times. I thought that the data.table rolling join would do it for me. However the rolling join appears to find the last time OR same time. So if you do a rolling join on two copies of the same table (after adding _PREV to the column names of the copy), this doesn't quite work. You can fudge it by adding a tiny number to the time variable of the copy but this is kinda awkward.

Is there a way to do this simply with rollin join or some other data.table method? I've found an efficient way but it still requires about 40 lines of R code. It seems that this could be a one-liner if rolling join could be told to look for the last time NOT including the same time. Or maybe there is some other neat trick.

Here is the example data.

data=data.table(customer_id=c(1,2,1,1,2),time_stamp=c(252,632,456,223,455),value=c(1,2,3,4,5))
data_sorted=data[order(customer_id,time_stamp)]

This is the code I wrote. Note that the lines putting NA into the ones where customer_id differ throws a warning and probably needs changing. I have them commented out below. Anyone have any suggestions for replacing those two lines?

add_prev_next_cbind<-function(data,ident="customer_id",timecol="time_stamp",prev_tag="PREV",
                   next_tag="NEXT",sep="_"){
  o=order(data[[ident]],data[[timecol]])
  uo=order(o)
  data=data[o,]
  Nrow=nrow(data)
  Ncol=ncol(data)
  #shift it, put any junk in the first row
  data_prev=data[c(1,1:(Nrow-1)),]
  #shift it, put any junk in the last row
  data_next=data[c(2:(Nrow),Nrow),]
  #flag the rows where the identity changes, these get NA
  prev_diff=data[[ident]] != data_prev[[ident]]
  prev_diff[1]=T
  next_diff=data[[ident]] != data_next[[ident]]  
  next_diff[Nrow]=T
  #change names
  names=names(data)
  names_prev=paste(names,prev_tag,sep=sep)
  names_next=paste(names,next_tag,sep=sep)
  setnames(data_prev,names,names_prev)
  setnames(data_next,names,names_next)
  #put NA in rows where prev and next are from a different ident
  #replace the next two lines with something else
  #data_prev[prev_diff,]<-NA
  #data_next[next_diff,]<-NA
  data_all=cbind(data,data_prev,data_next)
  data_all=data_all[uo,]
  return(data_all)
}
like image 613
user1827975 Avatar asked Mar 20 '13 20:03

user1827975


People also ask

What is a rolling join in R?

Rolling join, known also as last observation carried forward (LOCF), is an inequality join of two tables.

How do you join data tables in R?

If you want to join by multiple variables, then you need to specify a vector of variable names: by = c("var1", "var2", "var3") . Here all three columns must match in both tables. If you want to use all variables that appear in both tables, then you can leave the by argument blank.

When I is a data table or character vector the columns to join by must be specified using?

table (or character vector), the columns to join by must be specified using 'on=' argument (see ? data. table), by keying x (i.e. sorted, and, marked as sorted, see ? setkey), or by sharing column names between x and i (i.e., a natural join).


2 Answers

Update: #965 is now implemented in 1.9.5. From NEWS:

  1. New function shift() implements fast lead/lag of vector, list, data.frames or data.tables. It takes a type argument which can be either "lag" (default) or "lead" and always returns a list, which makes it very convenient to use it along with := or set(). For example: DT[, (cols) := shift(.SD, 1L), by=id]. Please have a look at ?shift for more info.

Now we can therefore do:

dt[, c("value_PREV", "value_NEXT") := c(shift(value, 1L, type="lag"), 
                     shift(value, 1L, type="lead")), by=customer_id]

You don't need a roll join here at all. you can do this with head and tail. Assuming your data.table is DT:

setkey(DT, "customer_id")
DT[, list(time_stamp = time_stamp, 
          prev.val = c(NA, head(value, -1)), 
          next.val = c(tail(value, -1), NA)), 
by=customer_id]
#   customer_id time_stamp prev.val next.val
# 1:           1        223       NA        1
# 2:           1        252        4        3
# 3:           1        456        1       NA
# 4:           2        455       NA        2
# 5:           2        632        5       NA

Edit: Even better:

DT[, `:=`(prev.val = c(NA, head(value, -1)), 
          next.val = c(tail(value, -1), NA)), 
          by=customer_id]
like image 194
Arun Avatar answered Nov 15 '22 13:11

Arun


Yes if I don't want roll to equimatch then I also take a little bit off if it's type double, or work with integer and add or subtract 1L.

DT = data.table( customer_id=c(1,2,1,1,2), 
                 time_stamp=as.integer(c(252,632,456,223,455)),
                 value=c(1,2,3,4,5))
setkey(DT, customer_id, time_stamp)
DT[ DT[,list(customer_id,time_stamp+1L,value)], value_PREV:=i.value, roll=-Inf]
DT[ DT[,list(customer_id,time_stamp-1L,value)], value_NEXT:=i.value, roll=+Inf]
DT
   customer_id time_stamp value value_PREV value_NEXT
1:           1        223     4         NA          1
2:           1        252     1          4          3
3:           1        456     3          1         NA
4:           2        455     5         NA          2
5:           2        632     2          5         NA

To have to take a column subset of DT again in i like that is a bit awkward, I agree.

Have now filed FR#2628 to add a new argument rollequal=TRUE|FALSE. Then it would be :

setkey(DT, customer_id, time_stamp)
DT[ DT, value_PREV:=i.value, roll=-Inf, rollequal=FALSE]
DT[ DT, value_NEXT:=i.value, roll=+Inf, rollequal=FALSE]

That would be faster too by avoiding the copy of the i columns and not needing to allocate for time_stamp-1L and time_stamp+1L.

But in this case, it's a self join from DT to DT and DT's key is unique, so as Arun says, a roll join isn't needed. Maybe a fast shift or lag function is needed to avoid the overhead of c() and head() or tail(), for speed.

Thanks for highlighting!

like image 38
Matt Dowle Avatar answered Nov 15 '22 14:11

Matt Dowle