Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Unintuitive (?) table joining behaviour with 2 data.tables

Tags:

r

data.table

When joining two data.tables using a non-equi join, in my head I imagine the computer looking from top to bottom of the join column until it finds an element that satisfies the join criteria.

library(data.table)

dt1 <- data.table(date = as.Date("2015-01-01"))
dt2 <- data.table(id = 1:4, date = as.Date(c("2014-01-01", "2015-02-01", "2016-01-01", "2017-01-01")))

setorder(dt2, date)

dt1[dt2, on = .(date <= date), `:=`(date2 = i.date, id = i.id)]

In the above example, the post-join dt1 looks like this:

         date      date2 id
1: 2015-01-01 2017-01-01  4

What confuses me is why the date that is joined from dt2 is the last date (2017-01-01) that satisfies the join criteria rather than the first date that meets the criteria (i.e. 2015-02-01).

I note that when you reverse the order of dt2 by running setorder(dt2, -date) it seems to join the earliest date that meets the criteria.

I can think of two ways this works. One is silly: the computer looks at the table from bottom-to-top and picks the first one. The second is that the join algorithm is greedy, it matches (and subsequently replaces matches) until it no longer finds matches meeting the criteria.

My question is whether the latter behaviour is indeed how data.table works, and if so, is there any way to change this behaviour other than reversing the order of the table.

like image 342
Aaron Avatar asked Mar 23 '26 01:03

Aaron


1 Answers

From reading the help file at ?`:=` , it looks like the second option is what is done. My rationale is based on this statement in the 'Details' section:

The '.Last.updated' variable contains the number of rows updated by the most recent ':=' or 'set' calls, which may be useful, for example, in production settings for testing assumptions about the number of rows affected by a statement; see '.Last.updated' for details.

Checking this variable after the code shown in the question shows:

dt1[dt2, on = .(date <= date), `:=`(date2 = i.date, id = i.id)]
.Last.updated
#[1] 3

So, 3 'rows' were updated, even though only 1 row is changed in the ultimate output.

As to how to control exactly which row is returned, a couple of options come to mind, both relying on the fact that dt2[dt1, ... is a right outer join which means every row in dt1 is returned by default. Which means if you nest dt2[dt1, ... inside dt1, you can directly assign via := like:

## if just the first or last is desired:
dt1[, c("date2","id") := dt2[dt1, on=.(date >= date), mult="last", .(x.date, id)]]

## if you want to pick the Nth match
dt1[, c("date2","id") := dt2[dt1, on=.(date >= date), 
                             data.table(x.date,id)[3], by=.EACHI][, -"date"]]
like image 125
thelatemail Avatar answered Mar 26 '26 13:03

thelatemail