Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Conditional binary join and update by reference using the data.table package

Tags:

join

r

data.table

So here is my real life problem which I feel like can be easily solved and I'm missing something obvious here. I have two big data sets called TK and DFT

library(data.table)
set.seed(123)
(TK <- data.table(venue_id = rep(1:3, each = 2), 
                  DFT_id = rep(1:3, 2), 
                  New_id = sample(1e4, 6),
                  key = "DFT_id"))

#    venue_id DFT_id New_id
# 1:        1      1   2876
# 2:        1      2   7883
# 3:        2      3   4089
# 4:        2      1   8828
# 5:        3      2   9401
# 6:        3      3    456

(DFT <- data.table(venue_id = rep(1:2, each = 2), 
                   DFT_id = 1:4, 
                   New_id = sample(4),
                   key = "DFT_id"))

#    venue_id DFT_id New_id
# 1:        1      1      3
# 2:        1      2      4
# 3:        2      3      2
# 4:        2      4      1

I want to perform a binary left join to TK on the DFT_id column when venue_id %in% 1:2, while updating New_id by reference. In other words, the desired result would be

TK
#    venue_id DFT_id New_id
# 1:        1      1      3
# 2:        2      1      3
# 3:        1      2      4
# 4:        3      2   9401
# 5:        2      3      2
# 6:        3      3    456

I was thinking to combine both conditions, but it didn't work (still not sure why)

TK[venue_id %in% 1:2 & DFT, New_id := i.New_id][]
# Error in `[.data.table`(TK, DFT & venue_id %in% 1:2, `:=`(New_id, i.New_id)) : 
#   i is invalid type (matrix). Perhaps in future a 2 column matrix could return a list of elements of DT (in the spirit of A[B] in FAQ 2.14). 
# Please let datatable-help know if you'd like this, or add your comments to FR #1611.

My next idea was to use chaining which partially achieves the goal by joining correctly but on some temporary table without actually affecting TK

TK[venue_id %in% 1:2][DFT, New_id := i.New_id][]
TK
#    venue_id DFT_id New_id
# 1:        1      1   2876
# 2:        2      1   8828
# 3:        1      2   7883
# 4:        3      2   9401
# 5:        2      3   4089
# 6:        3      3    456

So to make clear, I'm well aware that I can split TK into two tables, perform the join and then rbind again, but I'm doing many different conditional joins like this and I'm also looking for both speed and memory efficient solutions.

This also means that I am not looking for a dplyr solution as I'm trying to use both binary join and the update by reference features which only exist in the data.table package IIRC.


For additional information see these vignettes:

  • Update by reference
  • Binary joins
like image 903
David Arenburg Avatar asked Apr 15 '15 19:04

David Arenburg


2 Answers

Copying from Arun's updated answer here

TK[venue_id %in% 1:2, New_id := DFT[.SD, New_id]][]
#    venue_id DFT_id New_id
# 1:        1      1      3
# 2:        2      1      3
# 3:        1      2      4
# 4:        3      2   9401
# 5:        2      3      2
# 6:        3      3    456

His answer gives the details of what is going on.

like image 164
DaveTurek Avatar answered Oct 13 '22 22:10

DaveTurek


Here's a very simple approach:

TK[DFT, New_id := fifelse(venue_id %in% 1:2, i.New_id, New_id)][]
#    venue_id DFT_id New_id
# 1:        1      1      3
# 2:        2      1      3
# 3:        1      2      4
# 4:        3      2   9401
# 5:        2      3      2
# 6:        3      3    456

I haven't checked, but I suspect the other answer is faster.

like image 31
eddi Avatar answered Oct 13 '22 22:10

eddi