Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Rolling join on data.table with duplicate keys

Tags:

join

r

data.table

I'm trying to understand rolling joins in data.table. The data to reproduce this is given at the end.

Given a data.table of transactions at an airport, at a given time:

> dt
   t_id airport thisTime
1:    1       a      5.1
2:    3       a      5.1
3:    2       a      6.2  

(note t_ids 1 & 3 have the same airport and time)

and a lookup table of flights departing from airports:

> dt_lookup
   f_id airport thisTime
1:    1       a        6
2:    2       a        6
3:    1       b        7
4:    1       c        8
5:    2       d        7
6:    1       d        9
7:    2       e        8

> tables()
     NAME      NROW NCOL MB COLS                  KEY             
[1,] dt           3    3  1 t_id,airport,thisTime airport,thisTime
[2,] dt_lookup    7    3  1 f_id,airport,thisTime airport,thisTime

I would like to match all the transactions to all the next possible flights departing from that airport, to give:

   t_id airport thisTime f_id
      1       a        6    1
      1       a        6    2
      3       a        6    1
      3       a        6    2

So I thought this would work:

> dt[dt_lookup, nomatch=0,roll=Inf]
   t_id airport thisTime f_id
1:    3       a        6    1
2:    3       a        6    2

But it hasn't returned transactions t_id == 1.

From the documentation it says:

Usually, there should be no duplicates in x’s key,...

However, I do have duplicates in my 'x key' (namely airport & thisTime), and can't quite see/understand what's going on to mean t_id = 1 gets removed from the output.

Can anyone shed some light as to why t_id = 1 is not returned, and how can I get the join to work for when I have duplicates?

Data

library(data.table)
dt <- data.table(t_id = seq(1:3),
                 airport = c("a","a","a"),
                 thisTime = c(5.1,6.2, 5.1), key=c( "airport","thisTime"))

dt_lookup <- data.table(f_id = c(rep(1,4),rep(2,3)),
                        airport = c("a","b","c","d",
                                 "a","d","e"),
                        thisTime = c(6,7,8,9,
                                 6,7,8), key=c("airport","thisTime"))
like image 531
tospig Avatar asked Aug 14 '15 09:08

tospig


1 Answers

The reason that t_id = 1 doesn't show up in the output is because a rolling join takes the row where the key-combination occurs last. From the documentation (emphasis mine):

Applies to the last join column, generally a date but can be any ordered variable, irregular and including gaps. If roll=TRUE and i's row matches to all but the last x join column, and its value in the last i join column falls in a gap (including after the last observation in x for that group), then the prevailing value in x is rolled forward. This operation is particularly fast using a modified binary search. The operation is also known as last observation carried forward (LOCF).

Let's consider somewhat larger datasets:

> DT
   t_id airport thisTime
1:    1       a      5.1
2:    4       a      5.1
3:    3       a      5.1
4:    2       d      6.2
5:    5       d      6.2

> DT_LU
   f_id airport thisTime
1:    1       a        6
2:    2       a        6
3:    2       a        8
4:    1       b        7
5:    1       c        8
6:    2       d        7
7:    1       d        9

When you perform a rolling join just like in your question:

DT[DT_LU, nomatch=0, roll=Inf]

you get:

   t_id airport thisTime f_id
1:    3       a        6    1
2:    3       a        6    2
3:    3       a        8    2
4:    5       d        7    2
5:    5       d        9    1

As you can see, from both the key combination a, 5.1 and d, 6.2 the last row is used for the joined datatable. Because you use Inf as roll-value, all the future values are incorporated in the resulting datatable. When you use:

DT[DT_LU, nomatch=0, roll=1]

you see that only the first value in the future is included:

   t_id airport thisTime f_id
1:    3       a        6    1
2:    3       a        6    2
3:    5       d        7    2

If you want the f_id's for for all combinations of airport & thisTime where DT$thisTime is lower than DT_LU$thisTime, you can achieve that by creating a new variable (or replacing the existing thisTime) by means of the ceiling function. An example where I create a new variable thisTime2 and then do a normal join with DT_LU:

DT[, thisTime2 := ceiling(thisTime)]
setkey(DT, airport, thisTime2)[DT_LU, nomatch=0]

which gives:

   t_id airport thisTime thisTime2 f_id
1:    1       a      5.1         6    1
2:    4       a      5.1         6    1
3:    3       a      5.1         6    1
4:    1       a      5.1         6    2
5:    4       a      5.1         6    2
6:    3       a      5.1         6    2
7:    2       d      6.2         7    2
8:    5       d      6.2         7    2

Applied to the data you provided:

> dt[, thisTime2 := ceiling(thisTime)]
> setkey(dt, airport, thisTime2)[dt_lookup, nomatch=0]

   t_id airport thisTime thisTime2 f_id
1:    1       a      5.1         6    1
2:    3       a      5.1         6    1
3:    1       a      5.1         6    2
4:    3       a      5.1         6    2

When you want to include al the future values instead of only the first one, you need a somewhat different approach for which you will need the i.col functionality (which is not documented yet):

1: First set the key to only the airport columns:

setkey(DT, airport)
setkey(DT_LU, airport)

2: Use the i.col functionality (which is not documented yet) in j to get what you want as follows:

DT1 <- DT_LU[DT, .(tid = i.t_id,
                   tTime = i.thisTime,
                   fTime = thisTime[i.thisTime < thisTime],
                   fid = f_id[i.thisTime < thisTime]),
             by=.EACHI]

this gives you:

> DT1
    airport tid tTime fTime fid
 1:       a   1   5.1     6   1
 2:       a   1   5.1     6   2
 3:       a   1   5.1     8   2
 4:       a   4   5.1     6   1
 5:       a   4   5.1     6   2
 6:       a   4   5.1     8   2
 7:       a   3   5.1     6   1
 8:       a   3   5.1     6   2
 9:       a   3   5.1     8   2
10:       d   2   6.2     7   2
11:       d   2   6.2     9   1
12:       d   5   6.2     7   2
13:       d   5   6.2     9   1

Some explanation: In case when you are joining two datatables where the same columnnames are used, you can refer to the columns of the datatable in i by preceding the columnnames with i.. Now it's possible to compare thisTime from DT with thisTime from DT_LU. With by = .EACHI you assure that all combinations for with the condition holds are included in the resulting datatable.

Alternatively, you can achieve the same with:

DT2 <- DT_LU[DT, .(airport=i.airport,
                   tid=i.t_id,
                   tTime=i.thisTime,
                   fTime=thisTime[i.thisTime < thisTime],
                   fid=f_id[i.thisTime < thisTime]),
             allow.cartesian=TRUE]

which gives the same result:

> identical(DT1, DT2)
[1] TRUE

When you only want to include future values within a certain boundary, you can use:

DT1 <- DT_LU[DT, 
             {
               idx = i.thisTime < thisTime & thisTime - i.thisTime < 2
               .(tid  = i.t_id,
                 tTime = i.thisTime,
                 fTime = thisTime[idx],
                 fid = f_id[idx])
               },
             by=.EACHI]

which gives:

> DT1
   airport tid tTime fTime fid
1:       a   1   5.1     6   1
2:       a   1   5.1     6   2
3:       a   4   5.1     6   1
4:       a   4   5.1     6   2
5:       a   3   5.1     6   1
6:       a   3   5.1     6   2
7:       d   2   6.2     7   2
8:       d   5   6.2     7   2

When you compare that to the previous result, you see that now the rows 3, 6, 9, 10 and 12 have been removed.


Data:

DT <- data.table(t_id = c(1,4,2,3,5),
                 airport = c("a","a","d","a","d"),
                 thisTime = c(5.1, 5.1, 6.2, 5.1, 6.2),
                 key=c("airport","thisTime"))

DT_LU <- data.table(f_id = c(rep(1,4),rep(2,3)),
                    airport = c("a","b","c","d","a","d","e"),
                    thisTime = c(6,7,8,9,6,7,8),
                    key=c("airport","thisTime"))
like image 103
Jaap Avatar answered Nov 17 '22 11:11

Jaap