left outer join with data.table with different names for key variables

Question

Let me first start by saying that this is the first question that I am posting on stack overflow. Please let me know if I need to change the style,format etc in my questions.

I would like to do a left outer join operation on two data tables with an exra condition that I be allowed to have different names for the key variables in the two data tables. Example:

DT1 = data.table(x1=c("b","c", "a", "b", "a", "b"),   x2a=1:6,m1=seq(10,60,by=10))
setkey(DT1,x1,x2a)
> DT1
   x1 x2a m1
1:  a   3 30
2:  a   5 50
3:  b   1 10
4:  b   4 40
5:  b   6 60
6:  c   2 20
DT2 = data.table(x1=c("b","d", "c", "b","a","a"),x2b=c(1,4,7,6," "," "),m2=5:10)
setkey(DT2,x1,x2b)
> DT2
   x1 x2b m2
1:  a      9
2:  a     10
3:  b   1  5
4:  b   6  8
5:  c   7  7
6:  d   4  6
############# first, I use the merge operation on the data frames to do a left outer join
dfL<-merge.data.frame(DT1,DT2,by.x=c('x1','x2a'),by.y=c('x1','x2b'),all.x=TRUE)
> dfL
  x1 x2a m1 m2
1  a   3 30 NA
2  a   5 50 NA
3  b   1 10  5
4  b   4 40 NA
5  b   6 60  8
6  c   2 20 NA
################# attempt with data table left outer join 
> dtL<-DT2[DT1,on=c("x1","x2a")]
Error in forderv(x, by = rightcols) : 
  'by' value -2147483648 out of range [1,3]

#################### code that works with data table
DT1 = data.table(x1=c("b","c", "a", "b", "a", "b"), x2=as.character(1:6),m1=seq(10,60,by=10))
setkey(DT1,x1,x2)
DT1
DT2 = data.table(x1=c("b","d", "c", "b","a","a"),x2=c(1,4,7,6," "," ") ,m2=5:10)
setkey(DT2,x1,x2)
DT2
dtL<-DT2[DT1]
######################## this required identical naming of the key variables in the two data tables
################### Also does not allow a ad-hoc selection of the key variables with the "on" argument

I would like to know if it is possible to retain the flexibility of the merge command with data frames. With data.table.

Matt Dowle · Accepted Answer

As from v1.9.6 (on CRAN Sep 2015) :

X[Y] syntax can now join without having to set keys by using the new on argument. For example: DT1[DT2, on=c(x = "y")] would join column "y" of DT2 with "x" of DT1. DT1[DT2, on="y"] would join column "y" of both data.tables.

Please search the README for the string "on=" for further items relating to on= in v1.9.7 in development. You asked for left outer join, which is the default with X[Y] syntax (X[Y,nomatch=0] switches to inner join).

But note that setting a key is always going to faster to subsequently join to if you can pay the upfront cost of setkey (which is pretty fast in data.table). The concept is similar to a clustered index in SQL.

The advantage of X[Y] syntax vs merge() is that you can include j and by inside the same [...] query. Such a query is optimized so only the columns needed by the j expression are joined, saving time and RAM. See FAQ 1.12. by=.EACHI is also a powerful feature of X[Y] syntax you don't get with merge() and currently unique to data.table. You can use on= together with by=.EACHI inside a DT[...] query.

tospig · Answer

From ?data.table::merge

This merge method for data.table behaves very similarly to that of data.frames with one major exception: By default, the columns used to merge the data.tables are the shared key columns rather than the shared columns with the same names. Set the by, or by.x, by.y arguments explicitly to override this default.

So we can use the by arguments to override the keys.

library(data.table)

DT1 = data.table(x1=c("b","c", "a", "b", "a", "b"),   x2a=1:6,m1=seq(10,60,by=10))
DT2 = data.table(x1=c("b","d", "c", "b","a","a"),x2b=c(1,4,7,6," "," "),m2=5:10)

## you will get an error when joining a character to a integer:
DT2$x2b <- as.integer(DT2$x2b)
## Alternative:
## DT2 = data.table(x1=c("b","d", "c", "b","a","a"),x2b=c(1,4,7,6,NA,NA),m2=5:10)

merge(DT1, DT2, by.x=c('x1','x2a'), by.y=c('x1','x2b'), all.x=TRUE)

   x1 x2a m1 m2
1:  a   3 30 NA
2:  a   5 50 NA
3:  b   1 10  5
4:  b   4 40 NA
5:  b   6 60  8
6:  c   2 20 NA

left outer join with data.table with different names for key variables

Tags:

r

data.table

Vivek

2 Answers

Matt Dowle

tospig

Recent Activity

Donate For Us

left outer join with data.table with different names for key variables

Tags:

r

data.table

Vivek

2 Answers

Matt Dowle

tospig

Related questions

Recent Activity

Donate For Us