Use `j` to select the join column of `x` and all its non-join columns

Question

I have two data tables:

library(data.table)
d1 <- data.table(grp = c("a", "c", "b", "a"), val = c(2, 3, 6, 7), y1 = 1:4, y2 = 5:8)

d2 <- data.table(grp = rep(c("a", "b", "c"), 2),
                 from = rep(c(1, 5), each = 3), to = rep(c(4, 10), each = 3), z = 11:16)

I perform a non-equi join where the value 'val' in 'd1' should fall within the range defined by 'from' and 'to' in 'd2' for each group 'grp'.

d1[d2, on = .(grp, val >= from, val <= to), nomatch = 0]
#    grp val y1 y2 val.1  z
# 1:   a   1  1  5     4 11
# 2:   c   1  2  6     4 13
# 3:   a   5  4  8    10 14
# 4:   b   5  3  7    10 15

In the output, the join variables are from i ('val' and 'val.1', with the values of respectively 'from' and 'to' in 'd2'). However, I would like to have x's join column instead. Now, because...

Columns of x can now be referred to using the prefix x. and is particularly useful during joining to refer to x's join columns as they are otherwise masked by i's.

...this could be achieved by specifying val = x.val in j:

d1[d2, .(grp, val = x.val, z), on = .(grp, val >= from, val <= to), nomatch = 0]

In order to avoid typing all non-join columns (possibly many) from x in j, my current work-around is to join the above with the original data, which gives the desired result:

d1[d1[d2, .(grp, val = x.val, z), on = .(grp, val >= from, val <= to), nomatch = 0]
   , on = .(grp, val)]
#    grp val y1 y2  z
# 1:   a   2  1  5 11
# 2:   c   3  2  6 13
# 3:   a   7  4  8 14
# 4:   b   6  3  7 15

However, this seems a bit clumsy. Thus my question: how can I select the join column from x and all non-join columns from x in j in one go?

PS I have considered switching the x and i data sets, and the conditions in on. Although that produces the desired join values, it still requires post-processing (deleting, renaming and reordering of columns).

Frank · Accepted Answer

PS I have considered switching the x and i data sets, and the conditions in on. Although that produces the desired join values, it still requires post-processing (deleting, renaming and reordering of columns).

The amount of post processing is limited by how many on= cols there are:

d2[d1, on=.(grp, from <= val, to >= val), nomatch=0][, 
  `:=`(val = from, from = NULL, to = NULL)][]

That doesn't seem too bad.

Following @Jaap's comment, here's another way, adding columns to d1 with an update join:

nm2 = setdiff(names(d2), c("from","to","grp"))
d1[d2, on=.(grp, val >= from, val <= to), (nm2) := mget(sprintf("i.%s", nm2))]

This makes sense here because the desired output is essentially d1 plus some columns from d2 (since each row of d1 matches at most one row of d2).

Karthik Arumugham · Answer

Perhaps use foverlaps from data.table

#create duplicate range
setDT(d1)[,`:=`(val1 = val)]

#setkey
setkey(d1, grp, val, val1)
setkey(d2, grp, from, to)

#join
d_merge <- foverlaps(d1, d2, nomatch = NA)
setDT(d_merge)[,`:=`(from = NULL,
                     to = NULL,
                     val1 = NULL)]
d_merge
#    grp z val y1 y2
#1:   a 11   2  1  5
#2:   a 14   7  4  8
#3:   b 15   6  3  7
#4:   c 13   3  2  6

Use `j` to select the join column of `x` and all its non-join columns

Tags:

r

data.table

Henrik

2 Answers

Frank

Karthik Arumugham

Recent Activity

Donate For Us

Use `j` to select the join column of `x` and all its non-join columns

Tags:

r

data.table

Henrik

2 Answers

Frank

Karthik Arumugham

Related questions

Recent Activity

Donate For Us