Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

data.table join then add columns to existing data.frame without re-copy

I have two data.tables, X (3m rows by ~500 columns), and Y (100 rows by two columns).

set.seed(1)
X <- data.table( a=letters, b=letters, c=letters, g=sample(c(1:5,7),length(letters),replace=TRUE), key="g" )
Y <- data.table( z=runif(6), g=1:6, key="g" )

I want to do a left outer join on X, which I can do by Y[X] thanks to:

Why does X[Y] join of data.tables not allow a full outer join, or a left join?

But I want to add the new column to X without copying X (since it's huge).

Obviously, something like X <- Y[X] works, but unless data.table is far cleverer than I give it credit for (and I give it credit for quite a lot of deviousness!), I believe this copies the whole of X.

X[ , z:= Y[X,z]$z ] works, but is kludgy and doesn't scale well to more than one column.

How do I store the results of a merge back into the retained data.table in an efficient (both in terms of copies and in terms of programmer time) way?

like image 762
Ari B. Friedman Avatar asked Oct 23 '13 21:10

Ari B. Friedman


2 Answers

This is easy to do:

X[Y, z := i.z]

It works because the only difference between Y[X] and X[Y] here, is when some elements are not in Y, in which case presumably you'd want z to be NA, which the above assignment will exactly do.

It would also work just as well for many variables:

X[Y, `:=`(z1 = i.z1, z2 = i.z2, ...)]

Since you require the operation Y[X], you can add the argument nomatch=0 (as @mnel points out) so as to not get NAs for those where X doesn't contain the key values from Y. That is:

X[Y, z := i.z, nomatch=0]

From the NEWS for data.table

    **********************************************
    **                                          **
    **   CHANGES IN DATA.TABLE VERSION 1.7.10   **
    **                                          **
    **********************************************

NEW FEATURES

o   The prefix i. can now be used in j to refer to join inherited
    columns of i that are otherwise masked by columns in x with
    the same name.
like image 80
eddi Avatar answered Nov 09 '22 18:11

eddi


As an addition to the answer above, you can also do (v1.9.6+):

require(data.table) # v1.9.6+
X[Y, (colNames) := mget(paste0("i.", colNames))]

where colNames is a character vector listing the columns you want from Y. This lets you efficiently select columns to add (define colNames from a subset of names(Y)) in the case you are adding many columns.

Also, you can combine it with the new on= argument (from v1.9.6+) as:

# ad-hoc joins using 'on=' instead of setting keys
require(data.table) # v1.9.6+
X[Y, (colNames) := mget(paste0("i.", colNames)), on = "g"]

Credit to akrun for the (colNames) := mget(colNames) strategy here: Update rows of data frame in R.

like image 20
Alexander Li Avatar answered Nov 09 '22 19:11

Alexander Li