Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

non-equi joins adding all columns of range table in data.table in one step

Tags:

join

r

data.table

I'm sure I'm overlooking the obvious, but I can't find a way to join all the columns of the "lookup" table on a data.table non-equi join in one single step.

I looked at Arun's presentation (https://github.com/Rdatatable/data.table/wiki/talks/ArunSrinivasanSatRdaysBudapest2016.pdf) and multiple SO questions, but nearly all of them only deal with updating a single column, not joining multiple.

Suppose I have 2 data.tables a and b:

library(data.table)
a <- data.table(Company_ID = c(1,1,1,1),
            salary = c(2000, 3000, 4000, 5000))

#   Company_ID salary
# 1:          1   2000
# 2:          1   3000
# 3:          1   4000
# 4:          1   5000

b <- data.table(cat = c(1,2),
            LB = c(0, 3000),
            UB = c(3000,5000),
            rep = c("Bob","Alice"))

#    cat   LB   UB   rep
# 1:   1    0 3000   Bob
# 2:   2 3000 5000 Alice

What I want in the end is matching the cat, LB, UB, rep (all cols in b) to table a:

#    Company_ID salary cat   LB   UB   rep
# 1:          1   2000   1    0 3000   Bob
# 2:          1   3000   2 3000 5000 Alice
# 3:          1   4000   2 3000 5000 Alice

Currently, I the only way I manage to do it is with the following two lines:

a <- a[b, on = .(salary >= LB, salary < UB), cat := cat]
a[b, on = .(cat == cat)]

Which outputs the desired table, but seems cumbersome and not at all like a data.table approach. Any help would be greatly appreciated!

like image 685
bendae Avatar asked Jan 13 '17 15:01

bendae


People also ask

What is non equi join in database?

Non-equi joins are joins whose join conditions use conditional operators other than equals. An example would be where we are matching first name and then last name, but we are checking where one field from a table does not equal field from another table.

Which join will be used to perform non equi join?

NON EQUI JOIN performs a JOIN using comparison operator other than equal(=) sign like >, <, >=, <= with conditions.

How many tables can be joined by Equi join?

Equi Join Using Three Tables. We know that equijoin can also perform a join operation on more than two tables.

Does SQL support equi join?

The EQUI JOIN in SQL performs a JOIN against a column of equality or the matching column(s) values that have the associated tables. Here, we use an equal sign (=) as a comparison operator in our 'where' clause to refer to equality.

What is equi join in SQL?

1. EQUI JOIN : EQUI JOIN creates a JOIN for equality or matching column (s) values of the relative tables. EQUI JOIN also create JOIN by using JOIN with ON and then providing the names of the columns with their relative tables to check equality using equal sign (=). SELECT column_list FROM table1, table2....

How to select column_list from Table1 join table2?

SELECT column_list FROM table1 JOIN table2 [ON (join_condition)] SELECT student.name, student.id, record.class, record.city FROM student JOIN record ON student.city = record.city; 2. NON EQUI JOIN : NON EQUI JOIN performs a JOIN using comparison operator other than equal (=) sign like >, <, >=, <= with conditions.

How do I join tables by overlapping ranges in SQL?

The data.table package provides the foverlaps () function to join tables by overlapping ranges. (The ‘f’ in the function name stands for fast. The same is true for the fread (), fwrite (), and pretty much all of the other functions in the package that start with f.

How do you join two tables together in SQL?

How do you usually join two tables in SQL? Most likely, you select the common field in these two tables and join them using the equal sign in the join condition. For example, you can match the product ID from the product table with the product ID from the order table or the last name from the employee table with the last name from the timesheet.


2 Answers

Since you want results for every row of a, you should do a join like b[a, ...]:

b[a, on=.(LB <= salary, UB > salary), nomatch=0, 
  .(Company_ID, salary, cat, LB = x.LB, UB = x.UB, rep)]

   Company_ID salary cat   LB   UB   rep
1:          1   2000   1    0 3000   Bob
2:          1   3000   2 3000 5000 Alice
3:          1   4000   2 3000 5000 Alice
  • nomatch=0 means we'll drop rows of a that are unmatched in b.
  • We need to explicitly refer to the UB and LB columns from b using the x.* prefix (coming from the ?data.table docs, where the arguments are named like x[i]).

Regarding the strange default cols, there is an open issue to change that behavior: #1615.


(Issue #1989, referenced below, is fixed now -- See Uwe's answer.)

Alternately... One way that should work and avoids explicitly listing all columns: add a's columns to b, then subset b:

b[a, on=.(LB <= salary, UB > salary), names(a) := mget(paste0("i.", names(a)))] 
b[b[a, on=.(LB <= salary, UB > salary), which=TRUE, nomatch=0]]

There are two problems with this. First, there's a bug causing non-equi join to break when confronted with mget (#1989). The temporary workaround is to enumerate a's columns:

b[a, on=.(LB <= salary, UB > salary), `:=`(Company_ID = i.Company_ID, salary = i.salary)] 
b[b[a, on=.(LB <= salary, UB > salary), which=TRUE, nomatch=0]]

Second, it's inefficient to do this join twice (once for := and a second time for which), but I can't see any way around that... maybe justifying a feature request to allow both j and which?

like image 134
Frank Avatar answered Oct 03 '22 17:10

Frank


Now, that #1989 has been fixed with data.table version 1.12.3 (in development) it is possible to pick all columns from a and b without stating each column name explicitely:

a[b, on = .(salary >= LB, salary < UB), 
  mget(c(paste0("x.", names(a)), paste0("i.", names(b))))]
   x.Company_ID x.salary i.cat i.LB i.UB i.rep
1:            1     2000     1    0 3000   Bob
2:            1     3000     2 3000 5000 Alice
3:            1     4000     2 3000 5000 Alice

which returns OP's expected result except for the column headers.

To change the column headers, setnames() from the data.table package can be used:

result <- a[b, on = .(salary >= LB, salary < UB), 
            mget(c(paste0("x.", names(a)), paste0("i.", names(b))))] 
setnames(result, c(names(a), names(b)))
result
   Company_ID salary cat   LB   UB   rep
1:          1   2000   1    0 3000   Bob
2:          1   3000   2 3000 5000 Alice
3:          1   4000   2 3000 5000 Alice

or, with piping and using set_names() from the magrittr package

library(magrittr)
a[b, on = .(salary >= LB, salary < UB), 
  mget(c(paste0("x.", names(a)), paste0("i.", names(b)))) %>% 
    set_names(c(names(a), names(b)))]
   Company_ID salary cat   LB   UB   rep
1:          1   2000   1    0 3000   Bob
2:          1   3000   2 3000 5000 Alice
3:          1   4000   2 3000 5000 Alice

Admittedly, this is still cumbersome.

like image 43
Uwe Avatar answered Oct 03 '22 16:10

Uwe