I have a data set where I want to perform rolling functions over a date range.
I have currently done this via a loop, where I subset the main data for each range, then do the calculations and piece it together. It works well but the actual data set is over 2M rows and it took over 15 minutes to complete.
library(data.table)
start_date <- as.Date(fast_strptime("2021-03-04", "%Y-%m-%d"))
end_date <- start_date + 2
output <- data.table(NULL)
d = structure(list(date = structure(c(18690, 18690, 18692, 18692, 18692, 18693, 18693, 18694, 18695, 18695, 18695), class = "Date"),
id = c(1, 2, 1, 1, 2, 3, 1, 4, 4, 2, 1),
w = c(3, 1, 1, 1, 4, 2, 1, 2, 3, 4, 1)),
row.names = c(NA, -16L), class = c("data.table", "data.frame"))
while (end_date < Sys.Date()) {
x <- d[date >= start_date & date <= end_date, .(tw = sum(w)),
by = .(id)]
setorder(x, -tw, id)
x[, wprop := {x = sum(tw); y = cumsum(tw) / x}]
x[, idprop := {x = uniqueN(id); y = 1:.N / x}]
start_date <- end_date + 1
end_date <- start_date + 2
x[, start_date := start_date]
x[, end_date := end_date]
output <- rbindlist(list(output, x))
}
I would prefer a data.table solution since I will be doing this for a few different time windows so I need it to be as fast as possible.
First, iteratively building (growing) a frame (or data.table) can be a problem, so my first thought would be to wrap this in lapply instead of a while loop. This would generate a list of data.tables, after which we would do a single rbindlist. This alone should provide a significant speed improvement in your 2M datasets.
start_dates <- seq(as.Date(fast_strptime("2021-03-04", "%Y-%m-%d")),
max(d$date) + 1, by = "3 days")
list_of_tables <- lapply(start_dates, function(start_date) {
x <- d[date >= start_date & date <= start_date + 2L, .(tw = sum(w)), by = .(id)]
setorder(x, -tw, id)
x[, wprop := {x = sum(tw); y = cumsum(tw) / x}]
x[, idprop := {x = uniqueN(id); y = 1:.N / x}]
x[, start_date := start_date + 3L]
x[, end_date := start_date + 5L]
x
})
rbindlist(list_of_tables)
# id tw wprop idprop start_date end_date
# <num> <num> <num> <num> <Date> <Date>
# 1: 1 5 0.5000000 0.50 2021-03-07 2021-03-12
# 2: 2 5 1.0000000 1.00 2021-03-07 2021-03-12
# 3: 4 5 0.3846154 0.25 2021-03-10 2021-03-15
# 4: 2 4 0.6923077 0.50 2021-03-10 2021-03-15
# 5: 1 2 0.8461538 0.75 2021-03-10 2021-03-15
# 6: 3 2 1.0000000 1.00 2021-03-10 2021-03-15
Another method would be to use data.table's range-joins, doing it a little more directly.
ranges <- data.table(start_date = seq(as.Date(fast_strptime("2021-03-04", "%Y-%m-%d")),
max(d$date)+1, by="3 days"))
ranges[, end_date := start_date + 2L]
ranges
# start_date end_date
# <Date> <Date>
# 1: 2021-03-04 2021-03-06
# 2: 2021-03-07 2021-03-09
# 3: 2021-03-10 2021-03-12
From here,
tmp <- d[ranges, on = .(date >= start_date, date <= end_date)
][, .(tw = sum(w)), by = .(date, id)
][!is.na(id),]
setorder(tmp, -tw, id)
tmp[, c("wprop", "idprop") := .(cumsum(tw)/sum(tw), seq_len(.N)/uniqueN(id)), by = .(date)
][, c("start_date", "end_date") := .(date + 3L, date + 5L)
][, date := NULL]
# id tw wprop idprop start_date end_date
# <num> <num> <num> <num> <Date> <Date>
# 1: 1 5 0.5000000 0.50 2021-03-07 2021-03-09
# 2: 2 5 1.0000000 1.00 2021-03-07 2021-03-09
# 3: 4 5 0.3846154 0.25 2021-03-10 2021-03-12
# 4: 2 4 0.6923077 0.50 2021-03-10 2021-03-12
# 5: 1 2 0.8461538 0.75 2021-03-10 2021-03-12
# 6: 3 2 1.0000000 1.00 2021-03-10 2021-03-12
Notes:
ranges, there might be ranges with zero data, resulting in NA in the other fields. This is a natural side-effect of the join and this sample data; I don't know if it'll be an issue, but the [!is.na(id),] is meant to do that. I think an inner-join would feel more natural, but data.table::[-join semantics only support left joins.wprop and idprop) into a single [-line, no real reason, it can be separate calcs.ranges), so what we would like to reference as start_date/end_date are then known as date/date.1 immediately post-join. Fortunately we lose no data and don't strictly need date.1 this time, so it's a graceful recovery. (I could have renamed date to start_date instead of a new assignment, it might be more efficient for 2M rows, so feel free to see if it helps you.)FYI: with this data, benchmarking suggests that the range-join method runs in less than half the time compared to the list-of-frames method above. I doubt that that ratio will remain when the data size increases significantly. Please come back after trying both methods with your 2M row data and report run-times. Thanks!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With