Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Join on multiple keys and conditional update [duplicate]

Tags:

join

r

data.table

So here is my real life problem which I feel like can be easily solved and I'm missing something obvious here. I have two big data sets called TK and DFT

library(data.table)
set.seed(123)
(TK <- data.table(venue_id = rep(1:3, each = 2), 
                  DFT_id = rep(1:3, 2), 
                  New_id = sample(1e4, 6),
                  key = "DFT_id"))

#    venue_id DFT_id New_id
# 1:        1      1   2876
# 2:        1      2   7883
# 3:        2      3   4089
# 4:        2      1   8828
# 5:        3      2   9401
# 6:        3      3    456

(DFT <- data.table(venue_id = rep(1:2, each = 2), 
                   DFT_id = 1:4, 
                   New_id = sample(4),
                   key = "DFT_id"))

#    venue_id DFT_id New_id
# 1:        1      1      3
# 2:        1      2      4
# 3:        2      3      2
# 4:        2      4      1

I want to perform a binary left join to TK on the DFT_id column when venue_id %in% 1:2, while updating New_id by reference. In other words, the desired result would be

TK
#    venue_id DFT_id New_id
# 1:        1      1      3
# 2:        2      1      3
# 3:        1      2      4
# 4:        3      2   9401
# 5:        2      3      2
# 6:        3      3    456

I was thinking to combine both conditions, but it didn't work (still not sure why)

TK[venue_id %in% 1:2 & DFT, New_id := i.New_id][]
# Error in `[.data.table`(TK, DFT & venue_id %in% 1:2, `:=`(New_id, i.New_id)) : 
#   i is invalid type (matrix). Perhaps in future a 2 column matrix could return a list of elements of DT (in the spirit of A[B] in FAQ 2.14). 
# Please let datatable-help know if you'd like this, or add your comments to FR #1611.

My next idea was to use chaining which partially achieves the goal by joining correctly but on some temporary table without actually affecting TK

TK[venue_id %in% 1:2][DFT, New_id := i.New_id][]
TK
#    venue_id DFT_id New_id
# 1:        1      1   2876
# 2:        2      1   8828
# 3:        1      2   7883
# 4:        3      2   9401
# 5:        2      3   4089
# 6:        3      3    456

So to make clear, I'm well aware that I can split TK into two tables, perform the join and then rbind again, but I'm doing many different conditional joins like this and I'm also looking for both speed and memory efficient solutions.

This also means that I am not looking for a dplyr solution as I'm trying to use both binary join and the update by reference features which only exist in the data.table package IIRC.


For additional information see these vignettes:

  • Update by reference
  • Binary joins
like image 276
David Arenburg Avatar asked Apr 15 '15 19:04

David Arenburg


People also ask

How does on duplicate key update work?

ON DUPLICATE KEY UPDATE is a MariaDB/MySQL extension to the INSERT statement that, if it finds a duplicate unique or primary key, will instead perform an UPDATE. The row/s affected value is reported as 1 if a row is inserted, and 2 if a row is updated, unless the API's CLIENT_FOUND_ROWS flag is set.

How do I stop inserting duplicate records in MySQL?

Note − Use the INSERT IGNORE command rather than the INSERT command. If a record doesn't duplicate an existing record, then MySQL inserts it as usual. If the record is a duplicate, then the IGNORE keyword tells MySQL to discard it silently without generating an error.

How do I Upsert in MySQL?

We can perform MySQL UPSERT operation mainly in three ways, which are as follows: UPSERT using INSERT IGNORE. UPSERT using REPLACE. UPSERT using INSERT ON DUPLICATE KEY UPDATE.


2 Answers

Copying from Arun's updated answer here

TK[venue_id %in% 1:2, New_id := DFT[.SD, New_id]][]
#    venue_id DFT_id New_id
# 1:        1      1      3
# 2:        2      1      3
# 3:        1      2      4
# 4:        3      2   9401
# 5:        2      3      2
# 6:        3      3    456

His answer gives the details of what is going on.

like image 192
DaveTurek Avatar answered Nov 06 '22 21:11

DaveTurek


Here's a very simple approach:

TK[DFT, New_id := fifelse(venue_id %in% 1:2, i.New_id, New_id)][]
#    venue_id DFT_id New_id
# 1:        1      1      3
# 2:        2      1      3
# 3:        1      2      4
# 4:        3      2   9401
# 5:        2      3      2
# 6:        3      3    456

I haven't checked, but I suspect the other answer is faster.

like image 41
eddi Avatar answered Nov 06 '22 23:11

eddi