Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Rolling sum with inequality conditionals

Tags:

r

sum

I'm working with some historical financial data from multiple companies.

My data.frames/tables have the following form:

library(data.table)

dt <- data.table(company = rep(1:2, each = 9),
                 period = c(1, 2, 2, 2, 3, 3, 4, 5, 6), # recycled
                 result = c(1, 2, 3, 4, 5, 6, 7, 8, 9), # recycled
                 filingDate = as.Date(c('2000-01-01', # 1 1
                                        '2000-02-01', # 2 2
                                        '2000-02-02', # 2 3
                                        '2000-03-02', # 2 4
                                        '2000-03-01', # 3 5
                                        '2000-03-02', # 3 6
                                        '2000-04-01', # 4 7
                                        '2000-05-01', # 5 8
                                        '2000-06-01', # 6 9
                                        # company 2
                                        '2000-03-01', # 1 1
                                        '2000-04-01', # 2 2
                                        '2000-04-02', # 2 3
                                        '2000-05-02', # 2 4
                                        '2000-05-01', # 3 5
                                        '2000-06-02', # 3 6
                                        '2000-06-01', # 4 7
                                        '2000-07-01', # 5 8
                                        '2000-08-01'))) # 6 9
dt
#>     company period result filingDate
#>  1:       1      1      1 2000-01-01
#>  2:       1      2      2 2000-02-01
#>  3:       1      2      3 2000-02-02
#>  4:       1      2      4 2000-03-02
#>  5:       1      3      5 2000-03-01
#>  6:       1      3      6 2000-03-02
#>  7:       1      4      7 2000-04-01
#>  8:       1      5      8 2000-05-01
#>  9:       1      6      9 2000-06-01
#> 10:       2      1      1 2000-03-01
#> 11:       2      2      2 2000-04-01
#> 12:       2      2      3 2000-04-02
#> 13:       2      2      4 2000-05-02
#> 14:       2      3      5 2000-05-01
#> 15:       2      3      6 2000-06-02
#> 16:       2      4      7 2000-06-01
#> 17:       2      5      8 2000-07-01
#> 18:       2      6      9 2000-08-01

Created on 2019-12-26 by the reprex package (v0.3.0)

So I here have results reported by two different companies over 6 periods (i.e. quarters). However, companies sometimes notice they reported something wrong, in which case they make restatements. In this example, the companies had to make two restatements (for a total of three reports) for Period 2, and a restatement for Period 3.

The date at which each of these reports/restatements was handed in is shown in filingDate.

Now, I want to calculate the known rolling sum of two (or more) periods for each company over time.

If we only had one filing per period, that could be trivially achieved with dt[, rSum = frollsum(result, 2), by = company].

However, since we can have multiple filings per period, we need to filter these down to the most recent filing known at the "current date", since we can have periods restated after following periods have already been filed. An example is company 1's Period 2 restatement on 2000-03-02, after Period 3's filing on 2000-03-01.

So, the expected output would be:

#>     company period rSum filingDate
#>  1:       1      1    1 2000-01-01
#>  2:       1      2    3 2000-02-01  # 2 + 1
#>  3:       1      2    4 2000-02-02  # 3 + 1
#>  4:       1      2    5 2000-03-02  # 4 + 1 (optional row, see below)
#>  5:       1      3    8 2000-03-01  # 5 + 3
#>  6:       1      3   10 2000-03-02  # 6 + 4
#>  7:       1      4   13 2000-04-01  # 7 + 6
#>  8:       1      5   15 2000-05-01  # 8 + 7
#>  9:       1      6   17 2000-06-01  # 9 + 8
#> 10:       2      1    1 2000-03-01 
#> 11:       2      2    3 2000-04-01  # 2 + 1
#> 12:       2      2    4 2000-04-02  # 3 + 1
#> 13:       2      3    8 2000-05-01  # 5 + 3
#> 14:       2      3    9 2000-05-02  # 5 + 4
#> 15:       2      4   12 2000-06-01  # 7 + 5
#> 16:       2      4   13 2000-06-02  # 7 + 6
#> 17:       2      5   15 2000-07-01  # 8 + 7
#> 18:       2      6   17 2000-08-01  # 9 + 8 

There are a few nuances here which should be made explicit.

Looking at rows 4 and 6 of the input table dt, we see that Company 1 made two restatements on 2000-03-02, for Periods 2 and 3. Since we have results for Period 3, we are no longer in Period 2, so the relevant "most recent" rSum is Period 3's. That is why the expected output above indicates row 4 as optional: I really only care about the rSum on row 6, but any solution which includes that dummy result on row 4 is perfectly fine as well.

What's more complicated is what happens in row 15 of the input and 16 of the expected output: Company 2 makes a restatement of its Period 3 results on 2000-06-02. However, by that time we already have results for Period 4 (given on 2000-06-01). So, once again, the "most recent" rSum on that date is actually between Periods 3 and 4 (6 + 7 = 13), and should be associated with the more recent Period 4, not 3.

I can't see how to do this elegantly. All I can think of is getting all the unique filing dates and doing a for-loop through them, so as to get the most-recent results from previous periods. And note this doesn't even satisfy the second nuance above, since the result on 2000-06-02 is associated with Period 3.

r <- c()
for(comp in unique(dt$company)) {
  x <- dt[company == comp]
  for(d in x[, sort(unique(filingDate))]) {
    y <- tail(x[filingDate <= d,
                .SD[filingDate == max(filingDate)],
                by = .(company, period)], 2)
    r <- c(r, sum(y$result))
  }
}

x <- dt[, .SD[period == max(period)],
        by = .(company, filingDate)]
setorder(x, company, filingDate, period)
x[, rSum := r]
setorder(x, company, period, filingDate)
x
#>     company filingDate period result rSum
#>  1:       1 2000-01-01      1      1    1
#>  2:       1 2000-02-01      2      2    3
#>  3:       1 2000-02-02      2      3    4
#>  4:       1 2000-03-01      3      5    8
#>  5:       1 2000-03-02      3      6   10
#>  6:       1 2000-04-01      4      7   13
#>  7:       1 2000-05-01      5      8   15
#>  8:       1 2000-06-01      6      9   17
#>  9:       2 2000-03-01      1      1    1
#> 10:       2 2000-04-01      2      2    3
#> 11:       2 2000-04-02      2      3    4
#> 12:       2 2000-05-02      2      4    9
#> 13:       2 2000-05-01      3      5    8
#> 14:       2 2000-06-02      3      6   13  # should be Period 4
#> 15:       2 2000-06-01      4      7   12
#> 16:       2 2000-07-01      5      8   15
#> 17:       2 2000-08-01      6      9   17

This feels very un-R, though, so is there a more idiomatic way of doing this?

like image 290
Wasabi Avatar asked Nov 06 '22 11:11

Wasabi


1 Answers

An option is to use a non-equi join:

winsz <- 2L
dt[, c("start", "end") := .(period - winsz + 1L, period)] 

dt[, rSum :=
    dt[dt, on=.(company, end>=start, end<=end),
        by=.EACHI, i.result +
            #filter right table for rows with periods before current period and filingDate before current filingDate 
            #and select the latest result for each period and then sum those result
            sum(.SD[period<i.period & filingDate<=i.filingDate, result[.N], period]$V1)
    ]$V1
]

output:

    company period result filingDate start end rSum
 1:       1      1      1 2000-01-01     0   1    1
 2:       1      2      2 2000-02-01     1   2    3
 3:       1      2      3 2000-02-02     1   2    4
 4:       1      2      4 2000-03-02     1   2    5
 5:       1      3      5 2000-03-01     2   3    8
 6:       1      3      6 2000-03-02     2   3   10
 7:       1      4      7 2000-04-01     3   4   13
 8:       1      5      8 2000-05-01     4   5   15
 9:       1      6      9 2000-06-01     5   6   17
10:       2      1      1 2000-03-01     0   1    1
11:       2      2      2 2000-04-01     1   2    3
12:       2      2      3 2000-04-02     1   2    4
13:       2      2      4 2000-05-02     1   2    5
14:       2      3      5 2000-05-01     2   3    8
15:       2      3      6 2000-06-02     2   3   10
16:       2      4      7 2000-06-01     3   4   12
17:       2      5      8 2000-07-01     4   5   15
18:       2      6      9 2000-08-01     5   6   17

data:

library(data.table)
dt <- data.table(company = rep(1:2, each = 9),
    period = c(1, 2, 2, 2, 3, 3, 4, 5, 6),
    result = c(1, 2, 3, 4, 5, 6, 7, 8, 9),
    filingDate = as.Date(c('2000-01-01','2000-02-01','2000-02-02','2000-03-02','2000-03-01','2000-03-02','2000-04-01','2000-05-01','2000-06-01','2000-03-01','2000-04-01','2000-04-02','2000-05-02','2000-05-01','2000-06-02','2000-06-01','2000-07-01','2000-08-01')))
dt[, filingDate := as.IDate(filingDate, format="%Y-%m-%d")]
like image 169
chinsoon12 Avatar answered Nov 12 '22 16:11

chinsoon12