Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Use `j` to select the join column of `x` and all its non-join columns

Tags:

r

data.table

I have two data tables:

library(data.table)
d1 <- data.table(grp = c("a", "c", "b", "a"), val = c(2, 3, 6, 7), y1 = 1:4, y2 = 5:8)

d2 <- data.table(grp = rep(c("a", "b", "c"), 2),
                 from = rep(c(1, 5), each = 3), to = rep(c(4, 10), each = 3), z = 11:16)

I perform a non-equi join where the value 'val' in 'd1' should fall within the range defined by 'from' and 'to' in 'd2' for each group 'grp'.

d1[d2, on = .(grp, val >= from, val <= to), nomatch = 0]
#    grp val y1 y2 val.1  z
# 1:   a   1  1  5     4 11
# 2:   c   1  2  6     4 13
# 3:   a   5  4  8    10 14
# 4:   b   5  3  7    10 15

In the output, the join variables are from i ('val' and 'val.1', with the values of respectively 'from' and 'to' in 'd2'). However, I would like to have x's join column instead. Now, because...

Columns of x can now be referred to using the prefix x. and is particularly useful during joining to refer to x's join columns as they are otherwise masked by i's.

...this could be achieved by specifying val = x.val in j:

d1[d2, .(grp, val = x.val, z), on = .(grp, val >= from, val <= to), nomatch = 0]

In order to avoid typing all non-join columns (possibly many) from x in j, my current work-around is to join the above with the original data, which gives the desired result:

d1[d1[d2, .(grp, val = x.val, z), on = .(grp, val >= from, val <= to), nomatch = 0]
   , on = .(grp, val)]
#    grp val y1 y2  z
# 1:   a   2  1  5 11
# 2:   c   3  2  6 13
# 3:   a   7  4  8 14
# 4:   b   6  3  7 15

However, this seems a bit clumsy. Thus my question: how can I select the join column from x and all non-join columns from x in j in one go?


PS I have considered switching the x and i data sets, and the conditions in on. Although that produces the desired join values, it still requires post-processing (deleting, renaming and reordering of columns).

like image 467
Henrik Avatar asked Feb 19 '17 15:02

Henrik


2 Answers

PS I have considered switching the x and i data sets, and the conditions in on. Although that produces the desired join values, it still requires post-processing (deleting, renaming and reordering of columns).

The amount of post processing is limited by how many on= cols there are:

d2[d1, on=.(grp, from <= val, to >= val), nomatch=0][, 
  `:=`(val = from, from = NULL, to = NULL)][]

That doesn't seem too bad.


Following @Jaap's comment, here's another way, adding columns to d1 with an update join:

nm2 = setdiff(names(d2), c("from","to","grp"))
d1[d2, on=.(grp, val >= from, val <= to), (nm2) := mget(sprintf("i.%s", nm2))]

This makes sense here because the desired output is essentially d1 plus some columns from d2 (since each row of d1 matches at most one row of d2).

like image 94
Frank Avatar answered Nov 19 '22 09:11

Frank


Perhaps use foverlaps from data.table

#create duplicate range
setDT(d1)[,`:=`(val1 = val)]

#setkey
setkey(d1, grp, val, val1)
setkey(d2, grp, from, to)

#join
d_merge <- foverlaps(d1, d2, nomatch = NA)
setDT(d_merge)[,`:=`(from = NULL,
                     to = NULL,
                     val1 = NULL)]
d_merge
#    grp z val y1 y2
#1:   a 11   2  1  5
#2:   a 14   7  4  8
#3:   b 15   6  3  7
#4:   c 13   3  2  6
like image 20
Karthik Arumugham Avatar answered Nov 19 '22 09:11

Karthik Arumugham