I'm sure I'm overlooking the obvious, but I can't find a way to join all the columns of the "lookup" table on a data.table
non-equi join in one single step.
I looked at Arun's presentation (https://github.com/Rdatatable/data.table/wiki/talks/ArunSrinivasanSatRdaysBudapest2016.pdf) and multiple SO questions, but nearly all of them only deal with updating a single column, not joining multiple.
Suppose I have 2 data.tables a
and b
:
library(data.table)
a <- data.table(Company_ID = c(1,1,1,1),
salary = c(2000, 3000, 4000, 5000))
# Company_ID salary
# 1: 1 2000
# 2: 1 3000
# 3: 1 4000
# 4: 1 5000
b <- data.table(cat = c(1,2),
LB = c(0, 3000),
UB = c(3000,5000),
rep = c("Bob","Alice"))
# cat LB UB rep
# 1: 1 0 3000 Bob
# 2: 2 3000 5000 Alice
What I want in the end is matching the cat, LB, UB, rep (all cols in b
) to table a
:
# Company_ID salary cat LB UB rep
# 1: 1 2000 1 0 3000 Bob
# 2: 1 3000 2 3000 5000 Alice
# 3: 1 4000 2 3000 5000 Alice
Currently, I the only way I manage to do it is with the following two lines:
a <- a[b, on = .(salary >= LB, salary < UB), cat := cat]
a[b, on = .(cat == cat)]
Which outputs the desired table, but seems cumbersome and not at all like a data.table
approach. Any help would be greatly appreciated!
Non-equi joins are joins whose join conditions use conditional operators other than equals. An example would be where we are matching first name and then last name, but we are checking where one field from a table does not equal field from another table.
NON EQUI JOIN performs a JOIN using comparison operator other than equal(=) sign like >, <, >=, <= with conditions.
Equi Join Using Three Tables. We know that equijoin can also perform a join operation on more than two tables.
The EQUI JOIN in SQL performs a JOIN against a column of equality or the matching column(s) values that have the associated tables. Here, we use an equal sign (=) as a comparison operator in our 'where' clause to refer to equality.
1. EQUI JOIN : EQUI JOIN creates a JOIN for equality or matching column (s) values of the relative tables. EQUI JOIN also create JOIN by using JOIN with ON and then providing the names of the columns with their relative tables to check equality using equal sign (=). SELECT column_list FROM table1, table2....
SELECT column_list FROM table1 JOIN table2 [ON (join_condition)] SELECT student.name, student.id, record.class, record.city FROM student JOIN record ON student.city = record.city; 2. NON EQUI JOIN : NON EQUI JOIN performs a JOIN using comparison operator other than equal (=) sign like >, <, >=, <= with conditions.
The data.table package provides the foverlaps () function to join tables by overlapping ranges. (The ‘f’ in the function name stands for fast. The same is true for the fread (), fwrite (), and pretty much all of the other functions in the package that start with f.
How do you usually join two tables in SQL? Most likely, you select the common field in these two tables and join them using the equal sign in the join condition. For example, you can match the product ID from the product table with the product ID from the order table or the last name from the employee table with the last name from the timesheet.
Since you want results for every row of a
, you should do a join like b[a, ...]
:
b[a, on=.(LB <= salary, UB > salary), nomatch=0,
.(Company_ID, salary, cat, LB = x.LB, UB = x.UB, rep)]
Company_ID salary cat LB UB rep
1: 1 2000 1 0 3000 Bob
2: 1 3000 2 3000 5000 Alice
3: 1 4000 2 3000 5000 Alice
nomatch=0
means we'll drop rows of a
that are unmatched in b
. UB
and LB
columns from b
using the x.*
prefix (coming from the ?data.table
docs, where the arguments are named like x[i]
).Regarding the strange default cols, there is an open issue to change that behavior: #1615.
(Issue #1989, referenced below, is fixed now -- See Uwe's answer.)
Alternately... One way that should work and avoids explicitly listing all columns: add a
's columns to b
, then subset b
:
b[a, on=.(LB <= salary, UB > salary), names(a) := mget(paste0("i.", names(a)))]
b[b[a, on=.(LB <= salary, UB > salary), which=TRUE, nomatch=0]]
There are two problems with this. First, there's a bug causing non-equi join to break when confronted with mget
(#1989). The temporary workaround is to enumerate a
's columns:
b[a, on=.(LB <= salary, UB > salary), `:=`(Company_ID = i.Company_ID, salary = i.salary)]
b[b[a, on=.(LB <= salary, UB > salary), which=TRUE, nomatch=0]]
Second, it's inefficient to do this join twice (once for :=
and a second time for which
), but I can't see any way around that... maybe justifying a feature request to allow both j
and which
?
Now, that #1989 has been fixed with data.table version 1.12.3 (in development) it is possible to pick all columns from a
and b
without stating each column name explicitely:
a[b, on = .(salary >= LB, salary < UB),
mget(c(paste0("x.", names(a)), paste0("i.", names(b))))]
x.Company_ID x.salary i.cat i.LB i.UB i.rep 1: 1 2000 1 0 3000 Bob 2: 1 3000 2 3000 5000 Alice 3: 1 4000 2 3000 5000 Alice
which returns OP's expected result except for the column headers.
To change the column headers, setnames()
from the data.table
package can be used:
result <- a[b, on = .(salary >= LB, salary < UB),
mget(c(paste0("x.", names(a)), paste0("i.", names(b))))]
setnames(result, c(names(a), names(b)))
result
Company_ID salary cat LB UB rep 1: 1 2000 1 0 3000 Bob 2: 1 3000 2 3000 5000 Alice 3: 1 4000 2 3000 5000 Alice
or, with piping and using set_names()
from the magrittr
package
library(magrittr)
a[b, on = .(salary >= LB, salary < UB),
mget(c(paste0("x.", names(a)), paste0("i.", names(b)))) %>%
set_names(c(names(a), names(b)))]
Company_ID salary cat LB UB rep 1: 1 2000 1 0 3000 Bob 2: 1 3000 2 3000 5000 Alice 3: 1 4000 2 3000 5000 Alice
Admittedly, this is still cumbersome.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With