(Apologies if some of the terminology here is off - I come from a SQL background and I am only just getting into the R world)
I have a data table with a series of date-ordered entries. One of the fields in the data table is a grouping value, and one is a time value. With the data ordered (or keyed - I'm new to R and still not sure of the difference) by the group THEN the date, I want to count, for each row HOW MANY rows in this group precede the current row (including the current), within a given timespan.
Here's a simplified example of what I'm trying to do, using the Loblolly data set:
Prepping the example data:
library(lubridate)
library(zoo)
library(data.table)
DT = as.data.table(Loblolly)
DT[,rd := Sys.time() + years(age)]
setkey(DT,Seed,rd)
Now we have a data table ordered by Seed (the group) and rd (my date column). I have a solution which will produce my count value (ct) based on an interval of 10 years:
DT[,.ct:=mapply(function(x,y) DT[(rd>x-years(10) & rd<=x &Seed==y),.N],DT$rd,DT$Seed)]
This produces the desired result in this example dataset:
height age Seed rd ct
1: 3.93 3 329 2019-03-01 13:38:00 1
2: 9.34 5 329 2021-03-01 13:38:00 2
3: 26.08 10 329 2026-03-01 13:38:00 3
4: 37.79 15 329 2031-03-01 13:38:00 2
5: 48.31 20 329 2036-03-01 13:38:00 2
6: 56.43 25 329 2041-03-01 13:38:00 2
7: 4.12 3 327 2019-03-01 13:38:00 1
8: 9.92 5 327 2021-03-01 13:38:00 2
9: 26.54 10 327 2026-03-01 13:38:00 3
10: 37.82 15 327 2031-03-01 13:38:00 2
...
...
However, I need to scale this up to work on upwards of 5 million records, across approx 10,000 groups, and it takes an unfeasibly long time to run there. Is there a quicker and less clumsy method to do what I'm trying to do?
Here's a possible solution using data.table::foverlaps
. The idea here is to first join over the whole range of {Sys.time() - years(10), Sys.time() + years(age)}
. Then, count only the instances when the difference is less <= 10 years.
DT <- as.data.table(Loblolly)
DT[, c("rd", "rd2") := Sys.time() + years(age)] # create identical columns so foverlaps will work
setkey(DT, Seed, rd, rd2) # key by all for same reason
DT2 <- DT[, .(Seed, rd = rd - years(10), rd2, indx = .I)] # create minum range, create index to store row number
DT[, ct := foverlaps(DT, DT2)[i.rd > rd, .N, by = indx]$N] # run foverlaps, subset by condition and count
head(DT, 10)
# height age Seed rd rd2 ct
# 1: 3.93 3 329 2019-03-01 22:59:02 2019-03-01 22:59:02 1
# 2: 9.34 5 329 2021-03-01 22:59:02 2021-03-01 22:59:02 2
# 3: 26.08 10 329 2026-03-01 22:59:02 2026-03-01 22:59:02 3
# 4: 37.79 15 329 2031-03-01 22:59:02 2031-03-01 22:59:02 2
# 5: 48.31 20 329 2036-03-01 22:59:02 2036-03-01 22:59:02 2
# 6: 56.43 25 329 2041-03-01 22:59:02 2041-03-01 22:59:02 2
# 7: 4.12 3 327 2019-03-01 22:59:02 2019-03-01 22:59:02 1
# 8: 9.92 5 327 2021-03-01 22:59:02 2021-03-01 22:59:02 2
# 9: 26.54 10 327 2026-03-01 22:59:02 2026-03-01 22:59:02 3
# 10: 37.82 15 327 2031-03-01 22:59:02 2031-03-01 22:59:02 2
EDIT 17/3/2017:
Using data.table v1.10.4+ you can now use non-uqui joins combined with by = .EACHI
. Which basically allows you to both join using >=
and <=
rather than just exact join and also run computations while joining (in order to avoid Cartesian joins like in your case) and return just the final result. So in your specific case you can just do
DT[, rd10 := rd - years(10)]
DT[, ct := DT[DT, .N, on = .(Seed, rd <= rd, rd > rd10), by = .EACHI]$N]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With