Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

rolling joins in data.table with multiple matches

I have a remark/question about rolling joins
Let X,Y be:

set.seed(123);
X <- data.table(x=c(1,1,1,2,2),y=c(T,T,F,F,F),t=as.POSIXct("08:00:00.000",format="%H:%M:%OS")+sample(0:999,5,TRUE)/1e3)
Y <- copy(X)
set.seed(123)
Y[,`:=`(IDX=.I,t=t+sample(c(-5:5)/1e3,5,T))]
Y <- rbindlist(list(Y, X[5,][,IDX:=6][,t:=t+0.001], X[5,][,IDX:=7][,t:=t+0.002]))
setkey(X,x,y,t)
setkey(Y,x,y,t)

here X and Y are sorted by x,y,t

R) X
   x     y                       t
1: 1 FALSE 2013-06-20 08:00:00.407
2: 1  TRUE 2013-06-20 08:00:00.286
3: 1  TRUE 2013-06-20 08:00:00.788
4: 2 FALSE 2013-06-20 08:00:00.882
5: 2 FALSE 2013-06-20 08:00:00.940
R) Y
   x     y                       t IDX
1: 1 FALSE 2013-06-20 08:00:00.407   3
2: 1  TRUE 2013-06-20 08:00:00.284   1
3: 1  TRUE 2013-06-20 08:00:00.791   2
4: 2 FALSE 2013-06-20 08:00:00.886   4
5: 2 FALSE 2013-06-20 08:00:00.940   6
6: 2 FALSE 2013-06-20 08:00:00.942   7
7: 2 FALSE 2013-06-20 08:00:00.945   5

From

executing Y[X, roll=-0.005] gets you

R) Y[X, roll=-0.005]
       x     y                       t IDX
    1: 1 FALSE 2013-06-20 08:00:00.407  NA => due to precision the roll is no match
    2: 1  TRUE 2013-06-20 08:00:00.286  NA => ok 
    3: 1  TRUE 2013-06-20 08:00:00.788   2 => ok (x,y) matched and 788-791=-3
    4: 2 FALSE 2013-06-20 08:00:00.882   4 => same
    5: 2 FALSE 2013-06-20 08:00:00.940   6 => NOT AN EXACT MATCH (precision)

So I would have expected to get more lines FOR THE LAST LINE as the default behaviour of "mult" is "all" and the last line of X is matched by lines 5,6, may be 7 of Y

like image 330
statquant Avatar asked Oct 21 '22 07:10

statquant


1 Answers

First Row

For the first row of X only the first row of Y has matching x and y so the first row of Y will match iff Y$t[1] is between X$t[1] and X$t[1] + 0.005 but in fact Y$t[1] < X$t[1] as seen here:

> X$t[1] - Y$t[1]
Time difference of 0.0009999275 secs

One would need a positive roll= value whose magnitude is at least equal to the above difference in order to get a non-NA in row 1. e.g.

> Y[X, roll=0.001]
   x     y                   t IDX
1: 1 FALSE 2013-06-20 08:00:00   3
2: 1  TRUE 2013-06-20 08:00:00  NA
3: 1  TRUE 2013-06-20 08:00:00  NA
4: 2 FALSE 2013-06-20 08:00:00  NA
5: 2 FALSE 2013-06-20 08:00:00  NA

Note that you could force it by using rollends= like this:

> Y[X, roll = -0.005, rollends = TRUE]
   x     y                   t IDX
1: 1 FALSE 2013-06-20 08:00:00   3
2: 1  TRUE 2013-06-20 08:00:00  NA
3: 1  TRUE 2013-06-20 08:00:00   2
4: 2 FALSE 2013-06-20 08:00:00   4
5: 2 FALSE 2013-06-20 08:00:00   6

Last Row

For the last row of X only row 5 of Y matches, not 5, 6 and 7, since only the nearest eligible row is a match. mult= only applies to multiple matches and usually does not apply with roll= (see example at the bottom for an exception):

Note also that rows 5, 6 and 7 of Y do not have the same times. They have increasing times so there is no possibility that they all match:

> dput(Y[["t"]])
structure(c(1371729600.407, 1371729600.285, 1371729600.791, 1371729600.887, 
1371729600.941, 1371729600.942, 1371729600.945), class = c("POSIXct", 
"POSIXt"))

Even if rows 5, 6 and 7 of Y have identical times if those times are not the same as the time in the last row of X then one only gets one row out.

> # times in rows 5, 6 and 7 of Y2 are same
> Y2 <- copy(Y)
> Y2[, t:= t[c(1:4, 5, 5, 5)]]
> setkey(Y2, x, y, t)
> Y2[X, roll = -0.005]
   x     y                   t IDX
1: 1 FALSE 2013-06-20 08:00:00  NA
2: 1  TRUE 2013-06-20 08:00:00  NA
3: 1  TRUE 2013-06-20 08:00:00   2
4: 2 FALSE 2013-06-20 08:00:00   4
5: 2 FALSE 2013-06-20 08:00:00   6

Its only if rows 5, 6 and 7 of 'Y' have the same times and the last row of X also has that time that one gets multiple times out and in that case mult= can apply:

> # time in row 5 of X2 same as the times in rows 5, 6 and 7 of Y2
> X2 <- copy(X)
> X2[, t:=c(t[1:4], Y2[["t"]][5])]
> Y2[X2, roll = -0.005]
   x     y                   t IDX
1: 1 FALSE 2013-06-20 08:00:00  NA
2: 1  TRUE 2013-06-20 08:00:00  NA
3: 1  TRUE 2013-06-20 08:00:00   2
4: 2 FALSE 2013-06-20 08:00:00   4
5: 2 FALSE 2013-06-20 08:00:00   6
6: 2 FALSE 2013-06-20 08:00:00   7
7: 2 FALSE 2013-06-20 08:00:00   5
> 
> Y2[X, roll = -0.005, mult = "first"]
   x     y                   t IDX
1: 1 FALSE 2013-06-20 08:00:00  NA
2: 1  TRUE 2013-06-20 08:00:00  NA
3: 1  TRUE 2013-06-20 08:00:00   2
4: 2 FALSE 2013-06-20 08:00:00   4
5: 2 FALSE 2013-06-20 08:00:00   6

How it works is not so clear from the documentation and I had to use trial and error to discover how it functions. ?data.table does say that "Usually, there should be no duplicates in x's key" (in our example here x is Y) so it may be that the developers wished to leave it undefined for this situation and open to future change.

The idea of using mult= as you describe does seem like a very interesting idea but it seems not to be the way it currently works. Maybe it could in the future.

like image 184
G. Grothendieck Avatar answered Nov 03 '22 20:11

G. Grothendieck