I have two data.tables
, X (3m rows by ~500 columns), and Y (100 rows by two columns).
set.seed(1)
X <- data.table( a=letters, b=letters, c=letters, g=sample(c(1:5,7),length(letters),replace=TRUE), key="g" )
Y <- data.table( z=runif(6), g=1:6, key="g" )
I want to do a left outer join on X, which I can do by Y[X]
thanks to:
Why does X[Y] join of data.tables not allow a full outer join, or a left join?
But I want to add the new column to X
without copying X
(since it's huge).
Obviously, something like X <- Y[X]
works, but unless data.table
is far cleverer than I give it credit for (and I give it credit for quite a lot of deviousness!), I believe this copies the whole of X
.
X[ , z:= Y[X,z]$z ]
works, but is kludgy and doesn't scale well to more than one column.
How do I store the results of a merge back into the retained data.table in an efficient (both in terms of copies and in terms of programmer time) way?
This is easy to do:
X[Y, z := i.z]
It works because the only difference between Y[X]
and X[Y]
here, is when some elements are not in Y
, in which case presumably you'd want z
to be NA
, which the above assignment will exactly do.
It would also work just as well for many variables:
X[Y, `:=`(z1 = i.z1, z2 = i.z2, ...)]
Since you require the operation Y[X]
, you can add the argument nomatch=0
(as @mnel points out) so as to not get NAs for those where X doesn't contain the key values from Y. That is:
X[Y, z := i.z, nomatch=0]
From the NEWS for data.table
********************************************** ** ** ** CHANGES IN DATA.TABLE VERSION 1.7.10 ** ** ** **********************************************
NEW FEATURES
o The prefix i. can now be used in j to refer to join inherited columns of i that are otherwise masked by columns in x with the same name.
As an addition to the answer above, you can also do (v1.9.6+
):
require(data.table) # v1.9.6+
X[Y, (colNames) := mget(paste0("i.", colNames))]
where colNames
is a character vector listing the columns you want from Y
. This lets you efficiently select columns to add (define colNames
from a subset of names(Y)
) in the case you are adding many columns.
Also, you can combine it with the new on=
argument (from v1.9.6+
) as:
# ad-hoc joins using 'on=' instead of setting keys
require(data.table) # v1.9.6+
X[Y, (colNames) := mget(paste0("i.", colNames)), on = "g"]
Credit to akrun for the (colNames) := mget(colNames)
strategy here: Update rows of data frame in R.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With