Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Odd behavior when joining with multiple conditions

Tags:

join

r

data.table

In answering this question about rolling joins with the data.table package, I've run into some odd behavior when using multiple conditions.

Considering the following datasets:

dt <- data.table(t_id = c(1,4,2,3,5), place = c("a","a","d","a","d"), num = c(5.1, 5.1, 6.2, 5.1, 6.2), key=c("place"))
dt_lu <- data.table(f_id = c(rep(1,4),rep(2,3)), place = c("a","b","c","d","a","d","a"), num = c(6,7,8,9,6,7,8), key=c("place"))

When I want to join dt with dt_lu where only those cases of dt_lu that have the same place and where dt_lu$num is higher than dt$num as follows:

dt_lu[dt, list(tid = i.t_id,
               tnum = i.num,
               fnum = num[i.num < num],
               fid = f_id),
      by = .EACHI]

I get the desired result:

    place tid tnum fnum fid
 1:     a   1  5.1    6   1
 2:     a   1  5.1    6   2
 3:     a   1  5.1    8   2
 4:     a   4  5.1    6   1
 5:     a   4  5.1    6   2
 6:     a   4  5.1    8   2
 7:     a   3  5.1    6   1
 8:     a   3  5.1    6   2
 9:     a   3  5.1    8   2
10:     d   2  6.2    9   1
11:     d   2  6.2    7   2
12:     d   5  6.2    9   1
13:     d   5  6.2    7   2

When I want to add an additional condition, I can get the desired result easily by chaining that additional conditions as follows:

dt_lu[dt, list(tid = i.t_id,
               tnum = i.num,
               fnum = num[i.num < num],
               fid = f_id),
      by = .EACHI][fnum - tnum < 2]

which gives me:

   place tid tnum fnum fid
1:     a   1  5.1    6   1
2:     a   1  5.1    6   2
3:     a   4  5.1    6   1
4:     a   4  5.1    6   2
5:     a   3  5.1    6   1
6:     a   3  5.1    6   2
7:     d   2  6.2    7   2
8:     d   5  6.2    7   2

However when I add the extra condition (i.e.: the difference has to be less than 2) as follows:

dt_lu[dt, list(tid = i.t_id,
               tnum = i.num,
               fnum = num[i.num < num & num - i.num < 2],
               fid = f_id),
      by = .EACHI]

I do not get the expected result:

    place tid tnum fnum fid
 1:     a   1  5.1    6   1
 2:     a   1  5.1    6   2
 3:     a   1  5.1    6   2
 4:     a   4  5.1    6   1
 5:     a   4  5.1    6   2
 6:     a   4  5.1    6   2
 7:     a   3  5.1    6   1
 8:     a   3  5.1    6   2
 9:     a   3  5.1    6   2
10:     d   2  6.2    7   1
11:     d   2  6.2    7   2
12:     d   5  6.2    7   1
13:     d   5  6.2    7   2

Moreover, I get the following warning message:

Warning message: In [.data.table(dt_lu, dt, list(tid = i.t_id, tnum = i.num, fnum = num[i.num < : Column 3 of result for group 1 is length 2 but the longest column in this result is 3. Recycled leaving remainder of 1 items. This warning is once only for the first group with this issue.

The expected result would be:

    place tid tnum fnum fid
 1:     a   1  5.1    6   1
 2:     a   1  5.1    6   2
 4:     a   4  5.1    6   1
 5:     a   4  5.1    6   2
 7:     a   3  5.1    6   1
 8:     a   3  5.1    6   2
11:     d   2  6.2    7   2
13:     d   5  6.2    7   2

I deliberately kept the rownumbers from the first example to show which rows have to be maintained in the final result (which is the same as the working solution).

As this answer shows, it should be possible to use multiple conditions inside the join operation.

I tried the following alternatives, but both of them don't work:

dt_lu[dt, list(tid = i.t_id,
               tnum = i.num,
               fnum = num[(i.num < num) & (num - i.num < 2)],
               fid = f_id),
      by = .EACHI]

dt_lu[dt, {
  val = num[(i.num < num) & (num - i.num < 2)];
  list(tid = i.t_id,
       tnum = i.num,
       fnum = val,
       fid = f_id)},
  by = .EACHI]

Could someone explain me why I don't get the desired result with multiple conditions inside the join operation?

like image 685
Jaap Avatar asked Aug 16 '15 10:08

Jaap


People also ask

How many types of join conditions are there?

There are four main types of JOINs in SQL: INNER JOIN, OUTER JOIN, CROSS JOIN, and SELF JOIN.

What is compound join conditions?

A compound join joins two tables and are significant when columns of multiple tables have a unique key to join to another table, as in the case of a relational database. A compound join can be used in combination with a self join, inner join, or outer join.

What is conditions to perform join operation?

Usually, when you write a join query, you specify a condition that conveys a relationship between the tables specified in the FROM clause. This condition is referred to as the join condition. The join condition specifies how the rows from one table will be combined with the rows of another table.


1 Answers

The warning message gives away the issue. Also, using print() is quite helpful here.

dt_lu[dt, print(i.num < num & num - i.num < 2), by=.EACHI]
# [1]  TRUE  TRUE FALSE
# [1]  TRUE  TRUE FALSE
# [1]  TRUE  TRUE FALSE
# [1] FALSE  TRUE
# [1] FALSE  TRUE
# Empty data.table (0 rows) of 3 cols: place,place,num

Consider the first case where the condition evaluates to TRUE, TRUE, FALSE. There are 3 observations for this group. And your j-expression contains:

.(tid = i.t_id,
  tnum = i.num,
  fnum = num[i.num < num & num - i.num < 2],
  fid = f_id)

i.t_id and i.num are of length 1 (as they come from dt). But num[..condn..] will return length = 2, whereas f_id will return length = 3. Both the length=1 and length=2 items will be recycled to the length of the longest item/vector = 3. That leads to incorrect result. Since 3 isn't perfectly divisible by 2, it returns the warning.

What you intend to do is:

.(tid = i.t_id,
  tnum = i.num,
  fnum = num[i.num < num & num - i.num < 2],
  fid = f_id[i.num < num & num - i.num < 2])

or equivalently:

{  
  idx = i.num < num & num - i.num < 2
  .(tid  = i.t_id, tnum = i.num, fnum = num[idx], fid  = f_id[idx])
}

Putting it together:

dt_lu[dt, 
       {
         idx = i.num < num & num - i.num < 2
        .(tid  = i.t_id, tnum = i.num, fnum = num[idx], fid  = f_id[idx])
       }, 
by = .EACHI]
#    place tid tnum fnum fid
# 1:     a   1  5.1    6   1
# 2:     a   1  5.1    6   2
# 3:     a   4  5.1    6   1
# 4:     a   4  5.1    6   2
# 5:     a   3  5.1    6   1
# 6:     a   3  5.1    6   2
# 7:     d   2  6.2    7   2
# 8:     d   5  6.2    7   2
like image 161
Arun Avatar answered Oct 12 '22 21:10

Arun