In answering this question about rolling joins with the data.table
package, I've run into some odd behavior when using multiple conditions.
Considering the following datasets:
dt <- data.table(t_id = c(1,4,2,3,5), place = c("a","a","d","a","d"), num = c(5.1, 5.1, 6.2, 5.1, 6.2), key=c("place"))
dt_lu <- data.table(f_id = c(rep(1,4),rep(2,3)), place = c("a","b","c","d","a","d","a"), num = c(6,7,8,9,6,7,8), key=c("place"))
When I want to join dt
with dt_lu
where only those cases of dt_lu
that have the same place
and where dt_lu$num
is higher than dt$num
as follows:
dt_lu[dt, list(tid = i.t_id,
tnum = i.num,
fnum = num[i.num < num],
fid = f_id),
by = .EACHI]
I get the desired result:
place tid tnum fnum fid
1: a 1 5.1 6 1
2: a 1 5.1 6 2
3: a 1 5.1 8 2
4: a 4 5.1 6 1
5: a 4 5.1 6 2
6: a 4 5.1 8 2
7: a 3 5.1 6 1
8: a 3 5.1 6 2
9: a 3 5.1 8 2
10: d 2 6.2 9 1
11: d 2 6.2 7 2
12: d 5 6.2 9 1
13: d 5 6.2 7 2
When I want to add an additional condition, I can get the desired result easily by chaining that additional conditions as follows:
dt_lu[dt, list(tid = i.t_id,
tnum = i.num,
fnum = num[i.num < num],
fid = f_id),
by = .EACHI][fnum - tnum < 2]
which gives me:
place tid tnum fnum fid
1: a 1 5.1 6 1
2: a 1 5.1 6 2
3: a 4 5.1 6 1
4: a 4 5.1 6 2
5: a 3 5.1 6 1
6: a 3 5.1 6 2
7: d 2 6.2 7 2
8: d 5 6.2 7 2
However when I add the extra condition (i.e.: the difference has to be less than 2
) as follows:
dt_lu[dt, list(tid = i.t_id,
tnum = i.num,
fnum = num[i.num < num & num - i.num < 2],
fid = f_id),
by = .EACHI]
I do not get the expected result:
place tid tnum fnum fid
1: a 1 5.1 6 1
2: a 1 5.1 6 2
3: a 1 5.1 6 2
4: a 4 5.1 6 1
5: a 4 5.1 6 2
6: a 4 5.1 6 2
7: a 3 5.1 6 1
8: a 3 5.1 6 2
9: a 3 5.1 6 2
10: d 2 6.2 7 1
11: d 2 6.2 7 2
12: d 5 6.2 7 1
13: d 5 6.2 7 2
Moreover, I get the following warning message:
Warning message: In
[.data.table
(dt_lu, dt, list(tid = i.t_id, tnum = i.num, fnum = num[i.num < : Column 3 of result for group 1 is length 2 but the longest column in this result is 3. Recycled leaving remainder of 1 items. This warning is once only for the first group with this issue.
The expected result would be:
place tid tnum fnum fid
1: a 1 5.1 6 1
2: a 1 5.1 6 2
4: a 4 5.1 6 1
5: a 4 5.1 6 2
7: a 3 5.1 6 1
8: a 3 5.1 6 2
11: d 2 6.2 7 2
13: d 5 6.2 7 2
I deliberately kept the rownumbers from the first example to show which rows have to be maintained in the final result (which is the same as the working solution).
As this answer shows, it should be possible to use multiple conditions inside the join operation.
I tried the following alternatives, but both of them don't work:
dt_lu[dt, list(tid = i.t_id,
tnum = i.num,
fnum = num[(i.num < num) & (num - i.num < 2)],
fid = f_id),
by = .EACHI]
dt_lu[dt, {
val = num[(i.num < num) & (num - i.num < 2)];
list(tid = i.t_id,
tnum = i.num,
fnum = val,
fid = f_id)},
by = .EACHI]
Could someone explain me why I don't get the desired result with multiple conditions inside the join operation?
There are four main types of JOINs in SQL: INNER JOIN, OUTER JOIN, CROSS JOIN, and SELF JOIN.
A compound join joins two tables and are significant when columns of multiple tables have a unique key to join to another table, as in the case of a relational database. A compound join can be used in combination with a self join, inner join, or outer join.
Usually, when you write a join query, you specify a condition that conveys a relationship between the tables specified in the FROM clause. This condition is referred to as the join condition. The join condition specifies how the rows from one table will be combined with the rows of another table.
The warning message gives away the issue. Also, using print()
is quite helpful here.
dt_lu[dt, print(i.num < num & num - i.num < 2), by=.EACHI]
# [1] TRUE TRUE FALSE
# [1] TRUE TRUE FALSE
# [1] TRUE TRUE FALSE
# [1] FALSE TRUE
# [1] FALSE TRUE
# Empty data.table (0 rows) of 3 cols: place,place,num
Consider the first case where the condition evaluates to TRUE, TRUE, FALSE
. There are 3 observations for this group. And your j-expression
contains:
.(tid = i.t_id,
tnum = i.num,
fnum = num[i.num < num & num - i.num < 2],
fid = f_id)
i.t_id
and i.num
are of length 1 (as they come from dt
). But num[..condn..]
will return length = 2, whereas f_id
will return length = 3. Both the length=1 and length=2 items will be recycled to the length of the longest item/vector = 3. That leads to incorrect result. Since 3 isn't perfectly divisible by 2, it returns the warning.
What you intend to do is:
.(tid = i.t_id,
tnum = i.num,
fnum = num[i.num < num & num - i.num < 2],
fid = f_id[i.num < num & num - i.num < 2])
or equivalently:
{
idx = i.num < num & num - i.num < 2
.(tid = i.t_id, tnum = i.num, fnum = num[idx], fid = f_id[idx])
}
Putting it together:
dt_lu[dt,
{
idx = i.num < num & num - i.num < 2
.(tid = i.t_id, tnum = i.num, fnum = num[idx], fid = f_id[idx])
},
by = .EACHI]
# place tid tnum fnum fid
# 1: a 1 5.1 6 1
# 2: a 1 5.1 6 2
# 3: a 4 5.1 6 1
# 4: a 4 5.1 6 2
# 5: a 3 5.1 6 1
# 6: a 3 5.1 6 2
# 7: d 2 6.2 7 2
# 8: d 5 6.2 7 2
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With