So here is my real life problem which I feel like can be easily solved and I'm missing something obvious here. I have two big data sets called TK
and DFT
library(data.table)
set.seed(123)
(TK <- data.table(venue_id = rep(1:3, each = 2),
DFT_id = rep(1:3, 2),
New_id = sample(1e4, 6),
key = "DFT_id"))
# venue_id DFT_id New_id
# 1: 1 1 2876
# 2: 1 2 7883
# 3: 2 3 4089
# 4: 2 1 8828
# 5: 3 2 9401
# 6: 3 3 456
(DFT <- data.table(venue_id = rep(1:2, each = 2),
DFT_id = 1:4,
New_id = sample(4),
key = "DFT_id"))
# venue_id DFT_id New_id
# 1: 1 1 3
# 2: 1 2 4
# 3: 2 3 2
# 4: 2 4 1
I want to perform a binary left join to TK
on the DFT_id
column when venue_id %in% 1:2
, while updating New_id
by reference. In other words, the desired result would be
TK
# venue_id DFT_id New_id
# 1: 1 1 3
# 2: 2 1 3
# 3: 1 2 4
# 4: 3 2 9401
# 5: 2 3 2
# 6: 3 3 456
I was thinking to combine both conditions, but it didn't work (still not sure why)
TK[venue_id %in% 1:2 & DFT, New_id := i.New_id][]
# Error in `[.data.table`(TK, DFT & venue_id %in% 1:2, `:=`(New_id, i.New_id)) :
# i is invalid type (matrix). Perhaps in future a 2 column matrix could return a list of elements of DT (in the spirit of A[B] in FAQ 2.14).
# Please let datatable-help know if you'd like this, or add your comments to FR #1611.
My next idea was to use chaining which partially achieves the goal by joining correctly but on some temporary table without actually affecting TK
TK[venue_id %in% 1:2][DFT, New_id := i.New_id][]
TK
# venue_id DFT_id New_id
# 1: 1 1 2876
# 2: 2 1 8828
# 3: 1 2 7883
# 4: 3 2 9401
# 5: 2 3 4089
# 6: 3 3 456
So to make clear, I'm well aware that I can split TK
into two tables, perform the join and then rbind
again, but I'm doing many different conditional joins like this and I'm also looking for both speed and memory efficient solutions.
This also means that I am not looking for a dplyr
solution as I'm trying to use both binary join and the update by reference features which only exist in the data.table
package IIRC.
For additional information see these vignettes:
ON DUPLICATE KEY UPDATE is a MariaDB/MySQL extension to the INSERT statement that, if it finds a duplicate unique or primary key, will instead perform an UPDATE. The row/s affected value is reported as 1 if a row is inserted, and 2 if a row is updated, unless the API's CLIENT_FOUND_ROWS flag is set.
Note − Use the INSERT IGNORE command rather than the INSERT command. If a record doesn't duplicate an existing record, then MySQL inserts it as usual. If the record is a duplicate, then the IGNORE keyword tells MySQL to discard it silently without generating an error.
We can perform MySQL UPSERT operation mainly in three ways, which are as follows: UPSERT using INSERT IGNORE. UPSERT using REPLACE. UPSERT using INSERT ON DUPLICATE KEY UPDATE.
Copying from Arun's updated answer here
TK[venue_id %in% 1:2, New_id := DFT[.SD, New_id]][]
# venue_id DFT_id New_id
# 1: 1 1 3
# 2: 2 1 3
# 3: 1 2 4
# 4: 3 2 9401
# 5: 2 3 2
# 6: 3 3 456
His answer gives the details of what is going on.
Here's a very simple approach:
TK[DFT, New_id := fifelse(venue_id %in% 1:2, i.New_id, New_id)][]
# venue_id DFT_id New_id
# 1: 1 1 3
# 2: 2 1 3
# 3: 1 2 4
# 4: 3 2 9401
# 5: 2 3 2
# 6: 3 3 456
I haven't checked, but I suspect the other answer is faster.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With