I'm sure I'm overlooking the obvious, but I can't find a way to join all the columns of the "lookup" table on a <code>data.table</code> non-equi join in one single step. I looked at Arun's presentation (https://github.com/Rdatatable/data.table/wiki/talks/ArunSrinivasanSatRdaysBudapest2016.pdf) and multiple SO questions, but nearly all of them only deal with updating a single column, not joining multiple. Suppose I have 2 data.tables <code>a</code> and <code>b</code>: <pre class="prettyprint"><code>library(data.table) a <- data.table(Company_ID = c(1,1,1,1), salary = c(2000, 3000, 4000, 5000)) # Company_ID salary # 1: 1 2000 # 2: 1 3000 # 3: 1 4000 # 4: 1 5000 b <- data.table(cat = c(1,2), LB = c(0, 3000), UB = c(3000,5000), rep = c("Bob","Alice")) # cat LB UB rep # 1: 1 0 3000 Bob # 2: 2 3000 5000 Alice </code></pre> What I want in the end is matching the cat, LB, UB, rep (all cols in <code>b</code>) to table <code>a</code>: <pre class="prettyprint"><code># Company_ID salary cat LB UB rep # 1: 1 2000 1 0 3000 Bob # 2: 1 3000 2 3000 5000 Alice # 3: 1 4000 2 3000 5000 Alice </code></pre> Currently, I the only way I manage to do it is with the following two lines: <pre class="prettyprint"><code>a <- a[b, on = .(salary >= LB, salary < UB), cat := cat] a[b, on = .(cat == cat)] </code></pre> Which outputs the desired table, but seems cumbersome and not at all like a <code>data.table</code> approach. Any help would be greatly appreciated!

Since you want results for every row of <code>a</code>, you should do a join like <code>b[a, ...]</code>: <pre class="prettyprint"><code>b[a, on=.(LB <= salary, UB > salary), nomatch=0, .(Company_ID, salary, cat, LB = x.LB, UB = x.UB, rep)] Company_ID salary cat LB UB rep 1: 1 2000 1 0 3000 Bob 2: 1 3000 2 3000 5000 Alice 3: 1 4000 2 3000 5000 Alice </code></pre> <ul> <li> <code>nomatch=0</code> means we'll drop rows of <code>a</code> that are unmatched in <code>b</code>. </li> <li>We need to explicitly refer to the <code>UB</code> and <code>LB</code> columns from <code>b</code> using the <code>x.*</code> prefix (coming from the <code>?data.table</code> docs, where the arguments are named like <code>x[i]</code>).</li> </ul> Regarding the strange default cols, there is an open issue to change that behavior: #1615. <hr> (Issue #1989, referenced below, is fixed now -- See Uwe's answer.) Alternately... One way that should work and avoids explicitly listing all columns: add <code>a</code>'s columns to <code>b</code>, then subset <code>b</code>: <pre class="prettyprint"><code>b[a, on=.(LB <= salary, UB > salary), names(a) := mget(paste0("i.", names(a)))] b[b[a, on=.(LB <= salary, UB > salary), which=TRUE, nomatch=0]] </code></pre> There are two problems with this. First, there's a bug causing non-equi join to break when confronted with <code>mget</code> (#1989). The temporary workaround is to enumerate <code>a</code>'s columns: <pre class="prettyprint"><code>b[a, on=.(LB <= salary, UB > salary), `:=`(Company_ID = i.Company_ID, salary = i.salary)] b[b[a, on=.(LB <= salary, UB > salary), which=TRUE, nomatch=0]] </code></pre> Second, it's inefficient to do this join twice (once for <code>:=</code> and a second time for <code>which</code>), but I can't see any way around that... maybe justifying a feature request to allow both <code>j</code> and <code>which</code>?

Now, that #1989 has been fixed with data.table version 1.12.3 (in development) it is possible to pick all columns from <code>a</code> and <code>b</code> without stating each column name explicitely: <pre class="prettyprint"><code>a[b, on = .(salary >= LB, salary < UB), mget(c(paste0("x.", names(a)), paste0("i.", names(b))))] </code></pre> <blockquote> <pre class="prettyprint"><code> x.Company_ID x.salary i.cat i.LB i.UB i.rep 1: 1 2000 1 0 3000 Bob 2: 1 3000 2 3000 5000 Alice 3: 1 4000 2 3000 5000 Alice </code></pre> </blockquote> which returns OP's expected result except for the column headers. To change the column headers, <code>setnames()</code> from the <code>data.table</code> package can be used: <pre class="prettyprint"><code>result <- a[b, on = .(salary >= LB, salary < UB), mget(c(paste0("x.", names(a)), paste0("i.", names(b))))] setnames(result, c(names(a), names(b))) result </code></pre> <blockquote> <pre class="prettyprint"><code> Company_ID salary cat LB UB rep 1: 1 2000 1 0 3000 Bob 2: 1 3000 2 3000 5000 Alice 3: 1 4000 2 3000 5000 Alice </code></pre> </blockquote> or, with piping and using <code>set_names()</code> from the <code>magrittr</code> package <pre class="prettyprint"><code>library(magrittr) a[b, on = .(salary >= LB, salary < UB), mget(c(paste0("x.", names(a)), paste0("i.", names(b)))) %>% set_names(c(names(a), names(b)))] </code></pre> <blockquote> <pre class="prettyprint"><code> Company_ID salary cat LB UB rep 1: 1 2000 1 0 3000 Bob 2: 1 3000 2 3000 5000 Alice 3: 1 4000 2 3000 5000 Alice </code></pre> </blockquote> Admittedly, this is still cumbersome.

non-equi joins adding all columns of range table in data.table in one step

Q: How many tables can be joined by Equi join?

Equi Join Using Three Tables. We know that equijoin can also perform a join operation on more than two tables.

Q: Does SQL support equi join?

The EQUI JOIN in SQL performs a JOIN against a column of equality or the matching column(s) values that have the associated tables. Here, we use an equal sign (=) as a comparison operator in our 'where' clause to refer to equality.

Q: How to select column_list from Table1 join table2?

SELECT column_list FROM table1 JOIN table2 [ON (join_condition)] SELECT student.name, student.id, record.class, record.city FROM student JOIN record ON student.city = record.city; 2. NON EQUI JOIN : NON EQUI JOIN performs a JOIN using comparison operator other than equal (=) sign like >, <, >=, <= with conditions.

Q: How do I join tables by overlapping ranges in SQL?

The data.table package provides the foverlaps () function to join tables by overlapping ranges. (The ‘f’ in the function name stands for fast. The same is true for the fread (), fwrite (), and pretty much all of the other functions in the package that start with f.

Q: How do you join two tables together in SQL?

How do you usually join two tables in SQL? Most likely, you select the common field in these two tables and join them using the equal sign in the join condition. For example, you can match the product ID from the product table with the product ID from the order table or the last name from the employee table with the last name from the timesheet.

Tags:

join

r

data.table

I'm sure I'm overlooking the obvious, but I can't find a way to join all the columns of the "lookup" table on a data.table non-equi join in one single step.

I looked at Arun's presentation (https://github.com/Rdatatable/data.table/wiki/talks/ArunSrinivasanSatRdaysBudapest2016.pdf) and multiple SO questions, but nearly all of them only deal with updating a single column, not joining multiple.

Suppose I have 2 data.tables a and b:

library(data.table)
a <- data.table(Company_ID = c(1,1,1,1),
            salary = c(2000, 3000, 4000, 5000))

#   Company_ID salary
# 1:          1   2000
# 2:          1   3000
# 3:          1   4000
# 4:          1   5000

b <- data.table(cat = c(1,2),
            LB = c(0, 3000),
            UB = c(3000,5000),
            rep = c("Bob","Alice"))

#    cat   LB   UB   rep
# 1:   1    0 3000   Bob
# 2:   2 3000 5000 Alice

What I want in the end is matching the cat, LB, UB, rep (all cols in b) to table a:

#    Company_ID salary cat   LB   UB   rep
# 1:          1   2000   1    0 3000   Bob
# 2:          1   3000   2 3000 5000 Alice
# 3:          1   4000   2 3000 5000 Alice

Currently, I the only way I manage to do it is with the following two lines:

a <- a[b, on = .(salary >= LB, salary < UB), cat := cat]
a[b, on = .(cat == cat)]

Which outputs the desired table, but seems cumbersome and not at all like a data.table approach. Any help would be greatly appreciated!

685

asked Jan 13 '17 15:01

bendae

2 Answers

Since you want results for every row of a, you should do a join like b[a, ...]:

b[a, on=.(LB <= salary, UB > salary), nomatch=0, 
  .(Company_ID, salary, cat, LB = x.LB, UB = x.UB, rep)]

   Company_ID salary cat   LB   UB   rep
1:          1   2000   1    0 3000   Bob
2:          1   3000   2 3000 5000 Alice
3:          1   4000   2 3000 5000 Alice

nomatch=0 means we'll drop rows of a that are unmatched in b.
We need to explicitly refer to the UB and LB columns from b using the x.* prefix (coming from the ?data.table docs, where the arguments are named like x[i]).

Regarding the strange default cols, there is an open issue to change that behavior: #1615.

(Issue #1989, referenced below, is fixed now -- See Uwe's answer.)

Alternately... One way that should work and avoids explicitly listing all columns: add a's columns to b, then subset b:

b[a, on=.(LB <= salary, UB > salary), names(a) := mget(paste0("i.", names(a)))] 
b[b[a, on=.(LB <= salary, UB > salary), which=TRUE, nomatch=0]]

There are two problems with this. First, there's a bug causing non-equi join to break when confronted with mget (#1989). The temporary workaround is to enumerate a's columns:

b[a, on=.(LB <= salary, UB > salary), `:=`(Company_ID = i.Company_ID, salary = i.salary)] 
b[b[a, on=.(LB <= salary, UB > salary), which=TRUE, nomatch=0]]

Second, it's inefficient to do this join twice (once for := and a second time for which), but I can't see any way around that... maybe justifying a feature request to allow both j and which?

134

answered Oct 03 '22 17:10

Frank

Now, that #1989 has been fixed with data.table version 1.12.3 (in development) it is possible to pick all columns from a and b without stating each column name explicitely:

a[b, on = .(salary >= LB, salary < UB), 
  mget(c(paste0("x.", names(a)), paste0("i.", names(b))))]

   x.Company_ID x.salary i.cat i.LB i.UB i.rep
1:            1     2000     1    0 3000   Bob
2:            1     3000     2 3000 5000 Alice
3:            1     4000     2 3000 5000 Alice

which returns OP's expected result except for the column headers.

To change the column headers, setnames() from the data.table package can be used:

result <- a[b, on = .(salary >= LB, salary < UB), 
            mget(c(paste0("x.", names(a)), paste0("i.", names(b))))] 
setnames(result, c(names(a), names(b)))
result

   Company_ID salary cat   LB   UB   rep
1:          1   2000   1    0 3000   Bob
2:          1   3000   2 3000 5000 Alice
3:          1   4000   2 3000 5000 Alice

or, with piping and using set_names() from the magrittr package

library(magrittr)
a[b, on = .(salary >= LB, salary < UB), 
  mget(c(paste0("x.", names(a)), paste0("i.", names(b)))) %>% 
    set_names(c(names(a), names(b)))]

   Company_ID salary cat   LB   UB   rep
1:          1   2000   1    0 3000   Bob
2:          1   3000   2 3000 5000 Alice
3:          1   4000   2 3000 5000 Alice

Admittedly, this is still cumbersome.

answered Oct 03 '22 16:10

Uwe

Related questions
                            
                                Round data to the nearest even integer
                            
                                Adding Multiple Chart Series in Quantmod R
                            
                                RODBC: chars and numerics converted aggressively (with/without as.is)
                            
                                How to set background color, title in Plotly (python)?
                            
                                Repeat the rows in a data frame based on values in a specific column [duplicate]
                            
                                How to specify correlation between latent and observed variable in lavaan?
                            
                                Changing labels size while plotting conditional inference trees in R
                            
                                How to subset a time series in R
                            
                                force ggplot to evaluate counter variable
                            
                                Shiny/Leaflet map not rendering
                            
                                Repeatedly mutate variable using dplyr and purrr
                            
                                calculating hazard function for the standard normal distribution
                            
                                Find next date in series by group
                            
                                Why is POSIXct converted to numeric when converting list to vector
                            
                                Does data.table implement fast range subsetting based on binary search? What is that syntax?
                            
                                Display all values in a Shiny selectInput box (1000+)
                            
                                Why do I get NA coefficients and how does `lm` drop reference level for interaction
                            
                                In Shiny, update DataTable with new values from user input
                            
                                Error in Factor Analysis - Starting Values
                            
                                ggplot2: how to add text to multiple vertical lines (geom_vlines) on a time x-axis?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With