Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to do an X[Y] data.table join, without losing an existing main key on X?

Tags:

r

data.table

When joining data.tables X and Y using X[Y], X must have a key, which the key of Y uses to make the join. If X is a very big table and normally keyed on columns not used in the join, then the key of X needs to be changed for the join, and then restored back to the original key after the join. Is there an efficient way to do the join, without losing the original main key on X?

I have a large time-series environmental dataset DT (1M rows, 36 columns), as a data.table with key on the site and date columns. I need to do calculations on existing columns in DT and/or insert a new column, based an existing column, using a small lookup or recode table.

Here is a minimal example:

require(data.table)   # using v1.9.5

# main data table DT, keyed on site and date, with data column x
DT <- data.table(site = rep(LETTERS[1:2], each=3),
                 date = rep(1:3, times=2),
                 x = rep(1:3*10, times=2),
                 key = "site,date")
DT
#    site date  x
# 1:    A    1 10
# 2:    A    2 20
# 3:    A    3 30
# 4:    B    1 10
# 5:    B    2 20
# 6:    B    3 30

# lookup table for x to y lookup, keyed on x
x2y <- data.table(x = c(10,20), y = c(100,200), key = "x")
x2y
#     x   y
# 1: 10 100
# 2: 20 200

To join the lookup table x2y with the main table DT, I set the DT key to "x":

setkey(DT,x)

Then the join works as expected.

DT[x2y]
#    site date  x   y
# 1:    A    1 10 100
# 2:    B    1 10 100
# 3:    A    2 20 200
# 4:    B    2 20 200

and I can use "y" from the lookup table in calculations or to create a new column in DT.

DT[x2y, y:=y]
#    site date  x   y
# 1:    A    1 10 100
# 2:    B    1 10 100
# 3:    A    2 20 200
# 4:    B    2 20 200
# 5:    A    3 30  NA
# 6:    B    3 30  NA

But now my time-series dataset DT is keyed on "x", and I need to set the key back to "site,date" for further use.

setkey(DT,site,date)

Is this approach (key X, join, then re-key X) the fastest way to do this when DT is very large (1M rows), or is there an equally efficient way to do this type of lookup join, without losing the original key on the big DT table?

like image 449
Douglas Clark Avatar asked Apr 28 '15 11:04

Douglas Clark


2 Answers

Update: Thanks to a bug fix, this is no longer necessary. Please see the accepted answer.


I would join just x:

DT[,y:=x2y[J(DT$x)]$y]

The key of DT is retained here.

like image 179
Frank Avatar answered Sep 19 '22 22:09

Frank


With secondary keys implemented (since v1.9.6) and the recent bug fix on retaining/discarding keys properly (in v1.9.7), you can now do this using on=:

# join
DT[x2y, on="x"] # key is removed as row order gets changed.

# update using joins
DT[x2y, y:=y, on="x"] # key is retained, as row order isn't changed.
like image 35
Arun Avatar answered Sep 21 '22 22:09

Arun