I have two data.tables with many fields.
I want to join the two tables, add some calculated fields and append all other fields from the first, second or both tables (similar to SQL's select a+b AS sum, DT1.*, DT2.* FROM...
) without typing all the field names.
How can I do this (regarding easiest syntax and best performance)?
Simplified example data:
library(data.table)
DT1 = data.table(x=c("c", "a", "b", "a", "b"), a=1:5)
DT2 = data.table(x=c("d", "c", "b"), b=6:8)
Now I want to join the tables and add a calculated field:
DT1[DT2, .(sum=a + b, <<< how to say DT1.*, DT2.* here? >>> ), on="x"]
Update May 4, 2016: Inspired by user jangorecki I have found a feature request for this:
Should be able to refer to i's .SD during a join
category_id=category.id; The join is done by the JOIN operator. In the FROM clause, the name of the first table ( product ) is followed by a JOIN keyword then by the name of the second table ( category ). This is then followed by the keyword ON and by the condition for joining the rows from the different tables.
To join two tables based on a column match without loosing any of the data from the left table, you would use a LEFT OUTER JOIN. Left outer joins are used when you want to get all the values from one table but only the records that match the left table from the right table.
If you want to join by multiple variables, then you need to specify a vector of variable names: by = c("var1", "var2", "var3") . Here all three columns must match in both tables. If you want to use all variables that appear in both tables, then you can leave the by argument blank.
INNER JOIN. The INNER JOIN keyword selects all rows from both the tables as long as the condition is satisfied. This keyword will create the result-set by combining all rows from both the tables where the condition satisfies i.e value of the common field will be the same.
This should precisely answer your need.
It uses very powerful R feature called computing on the language (or meta programming) well described in official R Language Definition manual. This is an exceptional feature of R language and should not be forgotten IMO.
library(data.table)
DT1 = data.table(x=c("c", "a", "b", "a", "b"), a=1:5)
DT2 = data.table(x=c("d", "c", "b"), b=6:8)
jj = as.call(c(
list(as.name(".")),
list(sum = quote(a+b)),
lapply(unique(c(names(DT1), names(DT2))), as.name)
))
print(jj)
#.(sum = a + b, x, a, b)
DT1[DT2, eval(jj), on="x"]
# sum x a b
#1: NA d NA 6
#2: 8 c 1 7
#3: 11 b 3 8
#4: 13 b 5 8
I'm more certain of my answer to the second part of your question, so I'll answer that first. If you only want to say DT1.* or DT2.*, but want the additional column new = a+b, I would do it this way:
DT1[DT2,new:=a+b,on="x"]
For the first part, where you need DT1.* and DT2.*, the only answer I can think of is:
DT1[DT2, on="x"][,new := a+b]
However, there might be more efficient code to achieve this.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With