Rolling window function (uniqueN) in R by date range

Question

I have a data set where I want to perform rolling functions over a date range.

I have currently done this via a loop, where I subset the main data for each range, then do the calculations and piece it together. It works well but the actual data set is over 2M rows and it took over 15 minutes to complete.

library(data.table)
start_date <- as.Date(fast_strptime("2021-03-04", "%Y-%m-%d"))
end_date <- start_date + 2
output <- data.table(NULL)
d = structure(list(date = structure(c(18690, 18690, 18692, 18692, 18692, 18693, 18693, 18694, 18695, 18695, 18695), class = "Date"),
                   id = c(1, 2, 1, 1, 2, 3, 1, 4, 4, 2, 1),
                   w = c(3, 1, 1, 1, 4, 2, 1, 2, 3, 4, 1)),
              row.names = c(NA, -16L), class = c("data.table", "data.frame"))

while (end_date < Sys.Date()) {
  
  
  x <- d[date >= start_date & date <= end_date, .(tw = sum(w)), 
         by = .(id)]
  
  
  
  setorder(x, -tw, id)
  x[, wprop := {x = sum(tw); y = cumsum(tw) / x}]
  x[, idprop := {x = uniqueN(id); y = 1:.N / x}]
  
  start_date <- end_date + 1
  end_date <- start_date + 2
  
  x[, start_date := start_date]
  x[, end_date := end_date]

  output <- rbindlist(list(output, x))
  
    
}

I would prefer a data.table solution since I will be doing this for a few different time windows so I need it to be as fast as possible.

r2evans · Accepted Answer

First, iteratively building (growing) a frame (or data.table) can be a problem, so my first thought would be to wrap this in lapply instead of a while loop. This would generate a list of data.tables, after which we would do a single rbindlist. This alone should provide a significant speed improvement in your 2M datasets.

start_dates <- seq(as.Date(fast_strptime("2021-03-04", "%Y-%m-%d")), 
                   max(d$date) + 1, by = "3 days")
list_of_tables <- lapply(start_dates, function(start_date) {
  x <- d[date >= start_date & date <= start_date + 2L, .(tw = sum(w)), by = .(id)]
  setorder(x, -tw, id)
  x[, wprop := {x = sum(tw); y = cumsum(tw) / x}]
  x[, idprop := {x = uniqueN(id); y = 1:.N / x}]
  x[, start_date := start_date + 3L]
  x[, end_date := start_date + 5L]
  x
})
rbindlist(list_of_tables)
#       id    tw     wprop idprop start_date   end_date
#    <num> <num>     <num>  <num>     <Date>     <Date>
# 1:     1     5 0.5000000   0.50 2021-03-07 2021-03-12
# 2:     2     5 1.0000000   1.00 2021-03-07 2021-03-12
# 3:     4     5 0.3846154   0.25 2021-03-10 2021-03-15
# 4:     2     4 0.6923077   0.50 2021-03-10 2021-03-15
# 5:     1     2 0.8461538   0.75 2021-03-10 2021-03-15
# 6:     3     2 1.0000000   1.00 2021-03-10 2021-03-15

Another method would be to use data.table's range-joins, doing it a little more directly.

ranges <- data.table(start_date = seq(as.Date(fast_strptime("2021-03-04", "%Y-%m-%d")),
                                      max(d$date)+1, by="3 days"))
ranges[, end_date := start_date + 2L]
ranges
#    start_date   end_date
#        <Date>     <Date>
# 1: 2021-03-04 2021-03-06
# 2: 2021-03-07 2021-03-09
# 3: 2021-03-10 2021-03-12

From here,

tmp <- d[ranges, on = .(date >= start_date, date <= end_date)
  ][, .(tw = sum(w)), by = .(date, id)
  ][!is.na(id),]
setorder(tmp, -tw, id)
tmp[, c("wprop", "idprop") := .(cumsum(tw)/sum(tw), seq_len(.N)/uniqueN(id)), by = .(date)
  ][, c("start_date", "end_date") := .(date + 3L, date + 5L)
  ][, date := NULL]
#       id    tw     wprop idprop start_date   end_date
#    <num> <num>     <num>  <num>     <Date>     <Date>
# 1:     1     5 0.5000000   0.50 2021-03-07 2021-03-09
# 2:     2     5 1.0000000   1.00 2021-03-07 2021-03-09
# 3:     4     5 0.3846154   0.25 2021-03-10 2021-03-12
# 4:     2     4 0.6923077   0.50 2021-03-10 2021-03-12
# 5:     1     2 0.8461538   0.75 2021-03-10 2021-03-12
# 6:     3     2 1.0000000   1.00 2021-03-10 2021-03-12

Notes:

Because the LHS of the join is ranges, there might be ranges with zero data, resulting in NA in the other fields. This is a natural side-effect of the join and this sample data; I don't know if it'll be an issue, but the [!is.na(id),] is meant to do that. I think an inner-join would feel more natural, but data.table::[-join semantics only support left joins.
I combined the assignment (of wprop and idprop) into a single [-line, no real reason, it can be separate calcs.
The range join is destructive to the column names of join keys for the table on the LHS of the join (which is the inner frame, ranges), so what we would like to reference as start_date/end_date are then known as date/date.1 immediately post-join. Fortunately we lose no data and don't strictly need date.1 this time, so it's a graceful recovery. (I could have renamed date to start_date instead of a new assignment, it might be more efficient for 2M rows, so feel free to see if it helps you.)

FYI: with this data, benchmarking suggests that the range-join method runs in less than half the time compared to the list-of-frames method above. I doubt that that ratio will remain when the data size increases significantly. Please come back after trying both methods with your 2M row data and report run-times. Thanks!

Rolling window function (uniqueN) in R by date range

Tags:

r

data.table

rolling-computation

MidnightDataGeek

1 Answers

r2evans

Recent Activity

Donate For Us

Rolling window function (uniqueN) in R by date range

Tags:

r

data.table

rolling-computation

MidnightDataGeek

1 Answers

r2evans

Related questions

Recent Activity

Donate For Us