Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to do a data.table rolling join?

Tags:

r

data.table

I have two data tables that I'm trying to merge. One is data on company market values through time and the other is company dividend history through time. I'm trying to find out how much each company has paid each quarter and put that value next to the market value data through time.

library(magrittr)
library(data.table)
library(zoo)
library(lubridate)

set.seed(1337)
# data table of company  market values
companies <- 
    data.table(companyID = 1:10,
               Sedol = rep(c("91772E", "7A662B"), each = 5),
               Date = (as.Date("2005-04-01") + months(seq(0, 12, 3))) - days(1),
               MktCap = c(100 + cumsum(rnorm(5,5)),
                          50 + cumsum(rnorm(5,1,5)))) %>%
    setkey(Sedol, Date)

# data table of dividends
dividends <- 
    data.table(DivID = 1:7,
               Sedol = c(rep('91772E', each = 4), rep('7A662B', each = 3)),
               Date = as.Date(c('2004-11-19', '2005-01-13', '2005-01-29',
                                '2005-10-01', '2005-06-29', '2005-06-30',
                                '2006-04-17')),
               DivAmnt = rnorm(7, .8, .3)) %>%
    setkey(Sedol, Date)

I believe this is a situation where you could use a data.table rolling join, something like:

dividends[companies, roll = "nearest"]

to try and get a dataset that looks like

       DivID  Sedol       Date   DivAmnt companyID    MktCap
    1:    NA 7A662B       <NA>        NA         6  61.21061
    2:     5 7A662B 2005-06-29 0.7772631         7  66.92951
    3:     6 7A662B 2005-06-30 1.1815343         7  66.92951
    4:    NA 7A662B       <NA>        NA         8  78.33914
    5:    NA 7A662B       <NA>        NA         9  88.92473
    6:    NA 7A662B       <NA>        NA        10  87.85067
    7:     2 91772E 2005-01-13 0.2964291         1 105.19249
    8:     3 91772E 2005-01-29 0.8472649         1 105.19249
    9:    NA 91772E       <NA>        NA         2 108.74579
   10:     4 91772E 2005-10-01 1.2467408         3 113.42261
   11:    NA 91772E       <NA>        NA         4 120.04491
   12:    NA 91772E       <NA>        NA         5 124.35588

(note that I've matched the dividends to the company market values by the exact quarter)

But I'm not exactly sure how to execute it. The CRAN pdf is rather vague about what the number is or should be if roll is a value (Can you pass dates? Does a number quantify the days forward to carry? the number of obersvations?) and changing rollends around doesn't seem to get me what I want.

In the end, I ended up mapping the dividend dates to their quarter end and then joining on that. A good solution, but not useful if I end up needing to know how to perform rolling joins. In your answer, could you describe a situation where rolling joins are the only solution as well as help me understand how to perform them?

like image 916
jks612 Avatar asked Jan 27 '16 19:01

jks612


1 Answers

Instead of a rolling join, you may want to use an overlap join with the foverlaps function of data.table:

# create an interval in the 'companies' datatable
companies[, `:=` (start = compDate - days(90), end = compDate + days(15))]
# create a second date in the 'dividends' datatable
dividends[, Date2 := divDate]

# set the keys for the two datatable
setkey(companies, Sedol, start, end)
setkey(dividends, Sedol, divDate, Date2)

# create a vector of columnnames which can be removed afterwards
deletecols <- c("Date2","start","end")

# perform the overlap join and remove the helper columns
res <- foverlaps(companies, dividends)[, (deletecols) := NULL]

the result:

> res
     Sedol DivID    divDate   DivAmnt companyID   compDate    MktCap
 1: 7A662B    NA       <NA>        NA         6 2005-03-31  61.21061
 2: 7A662B     5 2005-06-29 0.7772631         7 2005-06-30  66.92951
 3: 7A662B     6 2005-06-30 1.1815343         7 2005-06-30  66.92951
 4: 7A662B    NA       <NA>        NA         8 2005-09-30  78.33914
 5: 7A662B    NA       <NA>        NA         9 2005-12-31  88.92473
 6: 7A662B    NA       <NA>        NA        10 2006-03-31  87.85067
 7: 91772E     2 2005-01-13 0.2964291         1 2005-03-31 105.19249
 8: 91772E     3 2005-01-29 0.8472649         1 2005-03-31 105.19249
 9: 91772E    NA       <NA>        NA         2 2005-06-30 108.74579
10: 91772E     4 2005-10-01 1.2467408         3 2005-09-30 113.42261
11: 91772E    NA       <NA>        NA         4 2005-12-31 120.04491
12: 91772E    NA       <NA>        NA         5 2006-03-31 124.35588

In the meantime the data.table authors have introduced non-equi joins (v1.9.8). You can also use that to solve this problem. Using a non-equi join you just need:

companies[, `:=` (start = compDate - days(90), end = compDate + days(15))]
dividends[companies, on = .(Sedol, divDate >= start, divDate <= end)]

to get the intended result.


Used data (the same as in the question, but without the creation of the keys):

set.seed(1337)
companies <- data.table(companyID = 1:10, Sedol = rep(c("91772E", "7A662B"), each = 5),
                        compDate = (as.Date("2005-04-01") + months(seq(0, 12, 3))) - days(1),
                        MktCap = c(100 + cumsum(rnorm(5,5)), 50 + cumsum(rnorm(5,1,5))))
dividends <- data.table(DivID = 1:7, Sedol = c(rep('91772E', each = 4), rep('7A662B', each = 3)),
                        divDate = as.Date(c('2004-11-19','2005-01-13','2005-01-29','2005-10-01','2005-06-29','2005-06-30','2006-04-17')),
                        DivAmnt = rnorm(7, .8, .3))
like image 89
Jaap Avatar answered Oct 03 '22 21:10

Jaap