Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Roll join gives NA's in data.table

Tags:

r

data.table

Sample data:

Usage = data.table(
  feature = 'M11', 
  startDate = structure(rep(17130, 17155, c(4, 3)), class = "Date"), 
  cc = 'X6', vendor = 'Z1'
)
Limits = data.table(
  vendorId = 'Z1',
  featureId = 'M11', 
  costcenter ='X6', oldLimit = 1:6, 
  date = structure(17044 + c(91, 61, 30, 0, 105, 75), class = "Date")
)

I am trying to add a column limit to the Usage data.table by looking at the Limits data.table. This is to find out what was the limit for that feature, costCenter, vendor combination at the time of its corresponding usage.

However when I try to roll-join using the below code, I get strange results. I get lot of NAs for my data, so created sample data as above. Below is my roll-join code.

Usage[Limits, limitAtStartDate:= i.oldLimit,   
      on = c(cc="costcenter", feature="featureId",
             vendor="vendorId", startDate="date" ), 
      roll=TRUE, verbose=TRUE][] 
#    feature  startDate cc vendor limitAtStartDate
# 1:     M11 2016-11-25 X6     Z1                6
# 2:     M11 2016-11-25 X6     Z1               NA
# 3:     M11 2016-11-25 X6     Z1               NA
# 4:     M11 2016-11-25 X6     Z1               NA
# 5:     M11 2016-12-20 X6     Z1                5
# 6:     M11 2016-12-20 X6     Z1               NA
# 7:     M11 2016-12-20 X6     Z1               NA

Why is that 5 & 6 are set only for one record for limitAtStartDate?

I am expecting 5 for all rows with date 2016-12-20 and 6 for all 2016-11-25. Please let me know where I am going wrong. I am using data.table version 1.10.0.

like image 385
pauljeba Avatar asked Feb 09 '17 08:02

pauljeba


1 Answers

When performing an X[Y] join in data.table what you are basically doing is for each value in Y you are trying to find a value in X. Hence, the resulting join will be of length of the Ys table. In your case, you are trying to find a value in Limits for each value in Usage and get a 7 length vector. Hence, you probably should join the other way around and then store it back into Limits

Limits[Usage, 
       oldLimit, 
       on = .(costcenter = cc, featureId = feature, vendorId = vendor, date = startDate),
       roll = TRUE]
## [1] 6 6 6 6 5 5 5

As a side note, for very (and some times not so) simple cases you could just use findInterval.

setorder(Limits, date)[findInterval(Usage$startDate, date), oldLimit]
## [1] 6 6 6 6 5 5 5

It is a very efficient function that have some caveats though

  • You need to sort the intervals vector first.
  • You can't set the rolling intervals easily as you would do in data.table (e.g. roll = 2 instead of just roll = TRUE)
  • And probably the biggest disadvantage is that it will be tricky to perform a rolling join on several variables at once (without involving loops) as you would easily do with data.table
like image 118
David Arenburg Avatar answered Nov 15 '22 04:11

David Arenburg