Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Update join with multiple rows

Tags:

r

data.table

Question

When doing an update-join, where the i table has multiple rows per key, how can you control which row is returned?

Example

In this example, the update-join returns the last row from dt2

library(data.table) 

dt1 <- data.table(id = 1) 
dt2 <- data.table(id = 1, letter = letters) 

dt1[ 
    dt2 
    , on = "id" 
    , letter := i.letter 
    ] 

dt1
#    id letter
# 1:  1      z

How can I control it to return the 1st, 2nd, nth row, rather than defaulting to the last?


References

A couple of references similar to this by user @Frank

  • data.table tutorial - in particular the 'warning' on update-joins
  • Issue on github
like image 551
SymbolixAU Avatar asked Mar 15 '18 22:03

SymbolixAU


People also ask

Can you do an UPDATE with a join?

An UPDATE statement can include JOIN operations. An UPDATE can contain zero, one, or multiple JOIN operations.

How do I join multiple rows?

Merge multiple rows using formulas To joint the values from several cells into one, you can use either the CONCATENATE function or concatenation operator (&). In Excel 2016 and higher, you can also use the CONCAT function. Any way, you supply cells as references and type the desired delimiters in-between.

Can you UPDATE 2 tables with a UPDATE statement in SQL?

In SQL Server, we can join two or more tables, but we cannot update the data of multiple tables in a single UPDATE statement.

How do you UPDATE data when joining two tables?

First, specify the name of the table (t1) that you want to update in the UPDATE clause. Next, specify the new value for each column of the updated table. Then, again specify the table from which you want to update in the FROM clause.


2 Answers

The most flexible idea I can think of is to only join the part of dt2 which contains the rows you want. So, for the second row:

dt1[ 
    dt2[, .SD[2], by=id]
    , on = "id" 
    , letter := i.letter
    ]

dt1
#   id letter
#1:  1      b

With a hat-tip to @Frank for simplifying the sub-select of dt2.

like image 98
thelatemail Avatar answered Sep 28 '22 07:09

thelatemail


How can I control it to return the 1st, 2nd, nth row, rather than defaulting to the last?

Not elegant, but sort-of works:

n = 3L
dt1[, v := dt2[.SD, on=.(id), x.letter[n], by=.EACHI]$V1]

A couple problems:

  1. It doesn't select using GForce, eg as seen here:

    > dt2[, letter[3], by=id, verbose=TRUE]
    Detected that j uses these columns: letter 
    Finding groups using forderv ... 0.020sec 
    Finding group sizes from the positions (can be avoided to save RAM) ... 0.000sec 
    lapply optimization is on, j unchanged as 'letter[3]'
    GForce optimized j to '`g[`(letter, 3)'
    Making each group and running j (GForce TRUE) ... 0.000sec 
       id V1
    1:  1  c
    
  2. If n is outside of 1:.N for some joined groups, no warning will be given:

    n = 40L
    dt1[, v := dt2[.SD, on=.(id), x.letter[n], by=.EACHI]$V1]
    

Alternately, make a habit of checking that i in an update join x[i] is "keyed" by the join columns:

cols = "id"
stopifnot(nrow(dt2) == uniqueN(dt2, by=cols))

And then make a different i table to join on if appropriate

mDT = dt2[, .(letter = letter[3L]), by=id]
dt1[mDT, on=cols, v := i.letter]
like image 30
Frank Avatar answered Sep 28 '22 06:09

Frank