How to find the last or next entry using R package data.table and rolling joins

Tags:

data.table

Lets say I have a data table like this.

   customer_id time_stamp value
1:           1        223     4
2:           1        252     1
3:           1        456     3
4:           2        455     5
5:           2        632     2

So that customer_id and time_stamp together form a unique key. I want to add some new columns indicating the previous and last values of "value". That is, I want output like this.

  customer_id time_stamp value value_PREV value_NEXT
1:           1        223     4         NA          1
2:           1        252     1          4          3
3:           1        456     3          1         NA
4:           2        455     5         NA          2
5:           2        632     2          5         NA

I want this to be fast and work with sparse, irregular times. I thought that the data.table rolling join would do it for me. However the rolling join appears to find the last time OR same time. So if you do a rolling join on two copies of the same table (after adding _PREV to the column names of the copy), this doesn't quite work. You can fudge it by adding a tiny number to the time variable of the copy but this is kinda awkward.

Is there a way to do this simply with rollin join or some other data.table method? I've found an efficient way but it still requires about 40 lines of R code. It seems that this could be a one-liner if rolling join could be told to look for the last time NOT including the same time. Or maybe there is some other neat trick.

Here is the example data.

data=data.table(customer_id=c(1,2,1,1,2),time_stamp=c(252,632,456,223,455),value=c(1,2,3,4,5))
data_sorted=data[order(customer_id,time_stamp)]

This is the code I wrote. Note that the lines putting NA into the ones where customer_id differ throws a warning and probably needs changing. I have them commented out below. Anyone have any suggestions for replacing those two lines?

add_prev_next_cbind<-function(data,ident="customer_id",timecol="time_stamp",prev_tag="PREV",
                   next_tag="NEXT",sep="_"){
  o=order(data[[ident]],data[[timecol]])
  uo=order(o)
  data=data[o,]
  Nrow=nrow(data)
  Ncol=ncol(data)
  #shift it, put any junk in the first row
  data_prev=data[c(1,1:(Nrow-1)),]
  #shift it, put any junk in the last row
  data_next=data[c(2:(Nrow),Nrow),]
  #flag the rows where the identity changes, these get NA
  prev_diff=data[[ident]] != data_prev[[ident]]
  prev_diff[1]=T
  next_diff=data[[ident]] != data_next[[ident]]  
  next_diff[Nrow]=T
  #change names
  names=names(data)
  names_prev=paste(names,prev_tag,sep=sep)
  names_next=paste(names,next_tag,sep=sep)
  setnames(data_prev,names,names_prev)
  setnames(data_next,names,names_next)
  #put NA in rows where prev and next are from a different ident
  #replace the next two lines with something else
  #data_prev[prev_diff,]<-NA
  #data_next[next_diff,]<-NA
  data_all=cbind(data,data_prev,data_next)
  data_all=data_all[uo,]
  return(data_all)
}

613

asked Mar 20 '13 20:03

user1827975

2 Answers

Update: #965 is now implemented in 1.9.5. From NEWS:

New function shift() implements fast lead/lag of vector, list, data.frames or data.tables. It takes a type argument which can be either "lag" (default) or "lead" and always returns a list, which makes it very convenient to use it along with := or set(). For example: DT[, (cols) := shift(.SD, 1L), by=id]. Please have a look at ?shift for more info.

Now we can therefore do:

dt[, c("value_PREV", "value_NEXT") := c(shift(value, 1L, type="lag"), 
                     shift(value, 1L, type="lead")), by=customer_id]

You don't need a roll join here at all. you can do this with head and tail. Assuming your data.table is DT:

setkey(DT, "customer_id")
DT[, list(time_stamp = time_stamp, 
          prev.val = c(NA, head(value, -1)), 
          next.val = c(tail(value, -1), NA)), 
by=customer_id]
#   customer_id time_stamp prev.val next.val
# 1:           1        223       NA        1
# 2:           1        252        4        3
# 3:           1        456        1       NA
# 4:           2        455       NA        2
# 5:           2        632        5       NA

Edit: Even better:

DT[, `:=`(prev.val = c(NA, head(value, -1)), 
          next.val = c(tail(value, -1), NA)), 
          by=customer_id]

194

answered Nov 15 '22 13:11

Arun

Yes if I don't want roll to equimatch then I also take a little bit off if it's type double, or work with integer and add or subtract 1L.

DT = data.table( customer_id=c(1,2,1,1,2), 
                 time_stamp=as.integer(c(252,632,456,223,455)),
                 value=c(1,2,3,4,5))
setkey(DT, customer_id, time_stamp)
DT[ DT[,list(customer_id,time_stamp+1L,value)], value_PREV:=i.value, roll=-Inf]
DT[ DT[,list(customer_id,time_stamp-1L,value)], value_NEXT:=i.value, roll=+Inf]
DT
   customer_id time_stamp value value_PREV value_NEXT
1:           1        223     4         NA          1
2:           1        252     1          4          3
3:           1        456     3          1         NA
4:           2        455     5         NA          2
5:           2        632     2          5         NA

To have to take a column subset of DT again in i like that is a bit awkward, I agree.

Have now filed FR#2628 to add a new argument rollequal=TRUE|FALSE. Then it would be :

setkey(DT, customer_id, time_stamp)
DT[ DT, value_PREV:=i.value, roll=-Inf, rollequal=FALSE]
DT[ DT, value_NEXT:=i.value, roll=+Inf, rollequal=FALSE]

That would be faster too by avoiding the copy of the i columns and not needing to allocate for time_stamp-1L and time_stamp+1L.

But in this case, it's a self join from DT to DT and DT's key is unique, so as Arun says, a roll join isn't needed. Maybe a fast shift or lag function is needed to avoid the overhead of c() and head() or tail(), for speed.

Thanks for highlighting!

answered Nov 15 '22 14:11

Matt Dowle

Related questions
                            
                                Elegant way to solve ddply task with aggregate (hoping for better performance)
                            
                                Dates and Times in separate columns, convert to datetime in R
                            
                                Pie plot getting its text on top of each other
                            
                                R: Using position information of elements when looping through a vector.
                            
                                Multiple definition of a function in C/C++ code
                            
                                Solidifying a melted data frame?
                            
                                In R, how to make dots transparent while using scale_size in ggplot2?
                            
                                plot only a select few facets in facet_grid
                            
                                Calculating the number of positive and negative runs in a vector
                            
                                Insert function variable into graph title
                            
                                Most efficient way to select ending item of an array?
                            
                                ROC curve for classification from randomForest
                            
                                Is this the solution to get data from EUROSTAT into R?
                            
                                Adding a matrix to a plot?
                            
                                Combining `expression()` with `\n`
                            
                                How to change font size of plot title when the title is a variable in ggplot2?
                            
                                change the font and colour of the title of the igraph
                            
                                ggplot version of charts.PerformanceSummary
                            
                                How to draw a border around a barplot in R the same way a border is drawn for a boxplot
                            
                                Accurately converting from character->POSIXct->character with sub millisecond datetimes

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With