Let me first start by saying that this is the first question that I am posting on stack overflow. Please let me know if I need to change the style,format etc in my questions.
I would like to do a left outer join operation on two data tables with an exra condition that I be allowed to have different names for the key variables in the two data tables. Example:
DT1 = data.table(x1=c("b","c", "a", "b", "a", "b"), x2a=1:6,m1=seq(10,60,by=10))
setkey(DT1,x1,x2a)
> DT1
x1 x2a m1
1: a 3 30
2: a 5 50
3: b 1 10
4: b 4 40
5: b 6 60
6: c 2 20
DT2 = data.table(x1=c("b","d", "c", "b","a","a"),x2b=c(1,4,7,6," "," "),m2=5:10)
setkey(DT2,x1,x2b)
> DT2
x1 x2b m2
1: a 9
2: a 10
3: b 1 5
4: b 6 8
5: c 7 7
6: d 4 6
############# first, I use the merge operation on the data frames to do a left outer join
dfL<-merge.data.frame(DT1,DT2,by.x=c('x1','x2a'),by.y=c('x1','x2b'),all.x=TRUE)
> dfL
x1 x2a m1 m2
1 a 3 30 NA
2 a 5 50 NA
3 b 1 10 5
4 b 4 40 NA
5 b 6 60 8
6 c 2 20 NA
################# attempt with data table left outer join
> dtL<-DT2[DT1,on=c("x1","x2a")]
Error in forderv(x, by = rightcols) :
'by' value -2147483648 out of range [1,3]
#################### code that works with data table
DT1 = data.table(x1=c("b","c", "a", "b", "a", "b"), x2=as.character(1:6),m1=seq(10,60,by=10))
setkey(DT1,x1,x2)
DT1
DT2 = data.table(x1=c("b","d", "c", "b","a","a"),x2=c(1,4,7,6," "," ") ,m2=5:10)
setkey(DT2,x1,x2)
DT2
dtL<-DT2[DT1]
######################## this required identical naming of the key variables in the two data tables
################### Also does not allow a ad-hoc selection of the key variables with the "on" argument
I would like to know if it is possible to retain the flexibility of the merge command with data frames. With data.table.
As from v1.9.6 (on CRAN Sep 2015) :
X[Y]
syntax can now join without having to set keys by using the newon
argument. For example:DT1[DT2, on=c(x = "y")]
would join column "y" ofDT2
with "x" ofDT1
.DT1[DT2, on="y"]
would join column "y" of both data.tables.
Please search the README for the string "on=" for further items relating to on=
in v1.9.7 in development. You asked for left outer join, which is the default with X[Y]
syntax (X[Y,nomatch=0]
switches to inner join).
But note that setting a key is always going to faster to subsequently join to if you can pay the upfront cost of setkey
(which is pretty fast in data.table). The concept is similar to a clustered index in SQL.
The advantage of X[Y]
syntax vs merge()
is that you can include j
and by
inside the same [...]
query. Such a query is optimized so only the columns needed by the j
expression are joined, saving time and RAM. See FAQ 1.12. by=.EACHI
is also a powerful feature of X[Y]
syntax you don't get with merge()
and currently unique to data.table. You can use on=
together with by=.EACHI
inside a DT[...]
query.
From ?data.table::merge
This merge method for data.table behaves very similarly to that of data.frames with one major exception: By default, the columns used to merge the data.tables are the shared key columns rather than the shared columns with the same names. Set the by, or by.x, by.y arguments explicitly to override this default.
So we can use the by
arguments to override the keys
.
library(data.table)
DT1 = data.table(x1=c("b","c", "a", "b", "a", "b"), x2a=1:6,m1=seq(10,60,by=10))
DT2 = data.table(x1=c("b","d", "c", "b","a","a"),x2b=c(1,4,7,6," "," "),m2=5:10)
## you will get an error when joining a character to a integer:
DT2$x2b <- as.integer(DT2$x2b)
## Alternative:
## DT2 = data.table(x1=c("b","d", "c", "b","a","a"),x2b=c(1,4,7,6,NA,NA),m2=5:10)
merge(DT1, DT2, by.x=c('x1','x2a'), by.y=c('x1','x2b'), all.x=TRUE)
x1 x2a m1 m2
1: a 3 30 NA
2: a 5 50 NA
3: b 1 10 5
4: b 4 40 NA
5: b 6 60 8
6: c 2 20 NA
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With