When joining data.tables X and Y using X[Y], X must have a key, which the key of Y uses to make the join. If X is a very big table and normally keyed on columns not used in the join, then the key of X needs to be changed for the join, and then restored back to the original key after the join. Is there an efficient way to do the join, without losing the original main key on X?
I have a large time-series environmental dataset DT (1M rows, 36 columns), as a data.table with key on the site and date columns. I need to do calculations on existing columns in DT and/or insert a new column, based an existing column, using a small lookup or recode table.
Here is a minimal example:
require(data.table) # using v1.9.5
# main data table DT, keyed on site and date, with data column x
DT <- data.table(site = rep(LETTERS[1:2], each=3),
date = rep(1:3, times=2),
x = rep(1:3*10, times=2),
key = "site,date")
DT
# site date x
# 1: A 1 10
# 2: A 2 20
# 3: A 3 30
# 4: B 1 10
# 5: B 2 20
# 6: B 3 30
# lookup table for x to y lookup, keyed on x
x2y <- data.table(x = c(10,20), y = c(100,200), key = "x")
x2y
# x y
# 1: 10 100
# 2: 20 200
To join the lookup table x2y with the main table DT, I set the DT key to "x":
setkey(DT,x)
Then the join works as expected.
DT[x2y]
# site date x y
# 1: A 1 10 100
# 2: B 1 10 100
# 3: A 2 20 200
# 4: B 2 20 200
and I can use "y" from the lookup table in calculations or to create a new column in DT.
DT[x2y, y:=y]
# site date x y
# 1: A 1 10 100
# 2: B 1 10 100
# 3: A 2 20 200
# 4: B 2 20 200
# 5: A 3 30 NA
# 6: B 3 30 NA
But now my time-series dataset DT is keyed on "x", and I need to set the key back to "site,date" for further use.
setkey(DT,site,date)
Is this approach (key X, join, then re-key X) the fastest way to do this when DT is very large (1M rows), or is there an equally efficient way to do this type of lookup join, without losing the original key on the big DT table?
Update: Thanks to a bug fix, this is no longer necessary. Please see the accepted answer.
I would join just x
:
DT[,y:=x2y[J(DT$x)]$y]
The key of DT
is retained here.
With secondary keys implemented (since v1.9.6) and the recent bug fix on retaining/discarding keys properly (in v1.9.7), you can now do this using on=
:
# join
DT[x2y, on="x"] # key is removed as row order gets changed.
# update using joins
DT[x2y, y:=y, on="x"] # key is retained, as row order isn't changed.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With