I have a remark/question about rolling joins
Let X,Y be:
set.seed(123);
X <- data.table(x=c(1,1,1,2,2),y=c(T,T,F,F,F),t=as.POSIXct("08:00:00.000",format="%H:%M:%OS")+sample(0:999,5,TRUE)/1e3)
Y <- copy(X)
set.seed(123)
Y[,`:=`(IDX=.I,t=t+sample(c(-5:5)/1e3,5,T))]
Y <- rbindlist(list(Y, X[5,][,IDX:=6][,t:=t+0.001], X[5,][,IDX:=7][,t:=t+0.002]))
setkey(X,x,y,t)
setkey(Y,x,y,t)
here X
and Y
are sorted by x,y,t
R) X
x y t
1: 1 FALSE 2013-06-20 08:00:00.407
2: 1 TRUE 2013-06-20 08:00:00.286
3: 1 TRUE 2013-06-20 08:00:00.788
4: 2 FALSE 2013-06-20 08:00:00.882
5: 2 FALSE 2013-06-20 08:00:00.940
R) Y
x y t IDX
1: 1 FALSE 2013-06-20 08:00:00.407 3
2: 1 TRUE 2013-06-20 08:00:00.284 1
3: 1 TRUE 2013-06-20 08:00:00.791 2
4: 2 FALSE 2013-06-20 08:00:00.886 4
5: 2 FALSE 2013-06-20 08:00:00.940 6
6: 2 FALSE 2013-06-20 08:00:00.942 7
7: 2 FALSE 2013-06-20 08:00:00.945 5
From
executing Y[X, roll=-0.005]
gets you
R) Y[X, roll=-0.005]
x y t IDX
1: 1 FALSE 2013-06-20 08:00:00.407 NA => due to precision the roll is no match
2: 1 TRUE 2013-06-20 08:00:00.286 NA => ok
3: 1 TRUE 2013-06-20 08:00:00.788 2 => ok (x,y) matched and 788-791=-3
4: 2 FALSE 2013-06-20 08:00:00.882 4 => same
5: 2 FALSE 2013-06-20 08:00:00.940 6 => NOT AN EXACT MATCH (precision)
So I would have expected to get more lines FOR THE LAST LINE as the default behaviour of "mult" is "all" and the last line of X
is matched by lines 5,6, may be 7
of Y
First Row
For the first row of X
only the first row of Y
has matching x
and y
so the first row of Y
will match iff Y$t[1]
is between X$t[1]
and X$t[1] + 0.005
but in fact Y$t[1] < X$t[1]
as seen here:
> X$t[1] - Y$t[1]
Time difference of 0.0009999275 secs
One would need a positive roll=
value whose magnitude is at least equal to the above difference in order to get a non-NA in row 1. e.g.
> Y[X, roll=0.001]
x y t IDX
1: 1 FALSE 2013-06-20 08:00:00 3
2: 1 TRUE 2013-06-20 08:00:00 NA
3: 1 TRUE 2013-06-20 08:00:00 NA
4: 2 FALSE 2013-06-20 08:00:00 NA
5: 2 FALSE 2013-06-20 08:00:00 NA
Note that you could force it by using rollends=
like this:
> Y[X, roll = -0.005, rollends = TRUE]
x y t IDX
1: 1 FALSE 2013-06-20 08:00:00 3
2: 1 TRUE 2013-06-20 08:00:00 NA
3: 1 TRUE 2013-06-20 08:00:00 2
4: 2 FALSE 2013-06-20 08:00:00 4
5: 2 FALSE 2013-06-20 08:00:00 6
Last Row
For the last row of X
only row 5 of Y
matches, not 5, 6 and 7, since only the nearest eligible row is a match. mult=
only applies to multiple matches and usually does not apply with roll=
(see example at the bottom for an exception):
Note also that rows 5, 6 and 7 of Y
do not have the same times. They have increasing times so there is no possibility that they all match:
> dput(Y[["t"]])
structure(c(1371729600.407, 1371729600.285, 1371729600.791, 1371729600.887,
1371729600.941, 1371729600.942, 1371729600.945), class = c("POSIXct",
"POSIXt"))
Even if rows 5, 6 and 7 of Y
have identical times if those times are not the same as the time in the last row of X
then one only gets one row out.
> # times in rows 5, 6 and 7 of Y2 are same
> Y2 <- copy(Y)
> Y2[, t:= t[c(1:4, 5, 5, 5)]]
> setkey(Y2, x, y, t)
> Y2[X, roll = -0.005]
x y t IDX
1: 1 FALSE 2013-06-20 08:00:00 NA
2: 1 TRUE 2013-06-20 08:00:00 NA
3: 1 TRUE 2013-06-20 08:00:00 2
4: 2 FALSE 2013-06-20 08:00:00 4
5: 2 FALSE 2013-06-20 08:00:00 6
Its only if rows 5, 6 and 7 of 'Y' have the same times and the last row of X
also has that time that one gets multiple times out and in that case mult=
can apply:
> # time in row 5 of X2 same as the times in rows 5, 6 and 7 of Y2
> X2 <- copy(X)
> X2[, t:=c(t[1:4], Y2[["t"]][5])]
> Y2[X2, roll = -0.005]
x y t IDX
1: 1 FALSE 2013-06-20 08:00:00 NA
2: 1 TRUE 2013-06-20 08:00:00 NA
3: 1 TRUE 2013-06-20 08:00:00 2
4: 2 FALSE 2013-06-20 08:00:00 4
5: 2 FALSE 2013-06-20 08:00:00 6
6: 2 FALSE 2013-06-20 08:00:00 7
7: 2 FALSE 2013-06-20 08:00:00 5
>
> Y2[X, roll = -0.005, mult = "first"]
x y t IDX
1: 1 FALSE 2013-06-20 08:00:00 NA
2: 1 TRUE 2013-06-20 08:00:00 NA
3: 1 TRUE 2013-06-20 08:00:00 2
4: 2 FALSE 2013-06-20 08:00:00 4
5: 2 FALSE 2013-06-20 08:00:00 6
How it works is not so clear from the documentation and I had to use trial and error to discover how it functions. ?data.table
does say that "Usually, there should be no duplicates in x's key" (in our example here x is Y
) so it may be that the developers wished to leave it undefined for this situation and open to future change.
The idea of using mult=
as you describe does seem like a very interesting idea but it seems not to be the way it currently works. Maybe it could in the future.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With