Unintuitive (?) table joining behaviour with 2 data.tables

Question

When joining two data.tables using a non-equi join, in my head I imagine the computer looking from top to bottom of the join column until it finds an element that satisfies the join criteria.

library(data.table)

dt1 <- data.table(date = as.Date("2015-01-01"))
dt2 <- data.table(id = 1:4, date = as.Date(c("2014-01-01", "2015-02-01", "2016-01-01", "2017-01-01")))

setorder(dt2, date)

dt1[dt2, on = .(date <= date), `:=`(date2 = i.date, id = i.id)]

In the above example, the post-join dt1 looks like this:

         date      date2 id
1: 2015-01-01 2017-01-01  4

What confuses me is why the date that is joined from dt2 is the last date (2017-01-01) that satisfies the join criteria rather than the first date that meets the criteria (i.e. 2015-02-01).

I note that when you reverse the order of dt2 by running setorder(dt2, -date) it seems to join the earliest date that meets the criteria.

I can think of two ways this works. One is silly: the computer looks at the table from bottom-to-top and picks the first one. The second is that the join algorithm is greedy, it matches (and subsequently replaces matches) until it no longer finds matches meeting the criteria.

My question is whether the latter behaviour is indeed how data.table works, and if so, is there any way to change this behaviour other than reversing the order of the table.

thelatemail · Accepted Answer

From reading the help file at ?`:=` , it looks like the second option is what is done. My rationale is based on this statement in the 'Details' section:

The '.Last.updated' variable contains the number of rows updated by the most recent ':=' or 'set' calls, which may be useful, for example, in production settings for testing assumptions about the number of rows affected by a statement; see '.Last.updated' for details.

Checking this variable after the code shown in the question shows:

dt1[dt2, on = .(date <= date), `:=`(date2 = i.date, id = i.id)]
.Last.updated
#[1] 3

So, 3 'rows' were updated, even though only 1 row is changed in the ultimate output.

As to how to control exactly which row is returned, a couple of options come to mind, both relying on the fact that dt2[dt1, ... is a right outer join which means every row in dt1 is returned by default. Which means if you nest dt2[dt1, ... inside dt1, you can directly assign via := like:

## if just the first or last is desired:
dt1[, c("date2","id") := dt2[dt1, on=.(date >= date), mult="last", .(x.date, id)]]

## if you want to pick the Nth match
dt1[, c("date2","id") := dt2[dt1, on=.(date >= date), 
                             data.table(x.date,id)[3], by=.EACHI][, -"date"]]

Unintuitive (?) table joining behaviour with 2 data.tables

Tags:

r

data.table

Aaron

1 Answers

thelatemail

Recent Activity

Donate For Us

Unintuitive (?) table joining behaviour with 2 data.tables

Tags:

r

data.table

Aaron

1 Answers

thelatemail

Related questions

Recent Activity

Donate For Us