Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R data.table join: SQL "select *" alike syntax in joined tables?

Tags:

join

r

data.table

I have two data.tables with many fields.

I want to join the two tables, add some calculated fields and append all other fields from the first, second or both tables (similar to SQL's select a+b AS sum, DT1.*, DT2.* FROM...) without typing all the field names.

How can I do this (regarding easiest syntax and best performance)?

Simplified example data:

library(data.table)
DT1 = data.table(x=c("c", "a", "b", "a", "b"), a=1:5)
DT2 = data.table(x=c("d", "c", "b"), b=6:8)

Now I want to join the tables and add a calculated field:

DT1[DT2, .(sum=a + b, <<< how to say DT1.*, DT2.* here? >>> ), on="x"]

Update May 4, 2016: Inspired by user jangorecki I have found a feature request for this:

Should be able to refer to i's .SD during a join

like image 397
R Yoda Avatar asked May 03 '16 14:05

R Yoda


People also ask

What is the SQL syntax for joining two tables?

category_id=category.id; The join is done by the JOIN operator. In the FROM clause, the name of the first table ( product ) is followed by a JOIN keyword then by the name of the second table ( category ). This is then followed by the keyword ON and by the condition for joining the rows from the different tables.

How do you join two similar tables?

To join two tables based on a column match without loosing any of the data from the left table, you would use a LEFT OUTER JOIN. Left outer joins are used when you want to get all the values from one table but only the records that match the left table from the right table.

How do you join data tables in R?

If you want to join by multiple variables, then you need to specify a vector of variable names: by = c("var1", "var2", "var3") . Here all three columns must match in both tables. If you want to use all variables that appear in both tables, then you can leave the by argument blank.

Which join will SELECT common data common from both tables?

INNER JOIN. The INNER JOIN keyword selects all rows from both the tables as long as the condition is satisfied. This keyword will create the result-set by combining all rows from both the tables where the condition satisfies i.e value of the common field will be the same.


2 Answers

This should precisely answer your need.
It uses very powerful R feature called computing on the language (or meta programming) well described in official R Language Definition manual. This is an exceptional feature of R language and should not be forgotten IMO.

library(data.table)
DT1 = data.table(x=c("c", "a", "b", "a", "b"), a=1:5)
DT2 = data.table(x=c("d", "c", "b"), b=6:8)

jj = as.call(c(
    list(as.name(".")),
    list(sum = quote(a+b)),
    lapply(unique(c(names(DT1), names(DT2))), as.name)
))
print(jj)
#.(sum = a + b, x, a, b)
DT1[DT2, eval(jj), on="x"]
#   sum x  a b
#1:  NA d NA 6
#2:   8 c  1 7
#3:  11 b  3 8
#4:  13 b  5 8
like image 144
jangorecki Avatar answered Oct 17 '22 00:10

jangorecki


I'm more certain of my answer to the second part of your question, so I'll answer that first. If you only want to say DT1.* or DT2.*, but want the additional column new = a+b, I would do it this way:

DT1[DT2,new:=a+b,on="x"]

For the first part, where you need DT1.* and DT2.*, the only answer I can think of is:

DT1[DT2, on="x"][,new := a+b]

However, there might be more efficient code to achieve this.

like image 33
shreyasgm Avatar answered Oct 17 '22 00:10

shreyasgm