Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Counting previous rows in a data table based on date

Tags:

r

data.table

zoo

(Apologies if some of the terminology here is off - I come from a SQL background and I am only just getting into the R world)

I have a data table with a series of date-ordered entries. One of the fields in the data table is a grouping value, and one is a time value. With the data ordered (or keyed - I'm new to R and still not sure of the difference) by the group THEN the date, I want to count, for each row HOW MANY rows in this group precede the current row (including the current), within a given timespan.

Here's a simplified example of what I'm trying to do, using the Loblolly data set:

Prepping the example data:

library(lubridate)
library(zoo)
library(data.table)
DT = as.data.table(Loblolly)
DT[,rd := Sys.time() + years(age)]
setkey(DT,Seed,rd)

Now we have a data table ordered by Seed (the group) and rd (my date column). I have a solution which will produce my count value (ct) based on an interval of 10 years:

DT[,.ct:=mapply(function(x,y) DT[(rd>x-years(10) & rd<=x &Seed==y),.N],DT$rd,DT$Seed)]

This produces the desired result in this example dataset:

    height age Seed                  rd  ct
 1:   3.93   3  329 2019-03-01 13:38:00   1
 2:   9.34   5  329 2021-03-01 13:38:00   2
 3:  26.08  10  329 2026-03-01 13:38:00   3
 4:  37.79  15  329 2031-03-01 13:38:00   2
 5:  48.31  20  329 2036-03-01 13:38:00   2
 6:  56.43  25  329 2041-03-01 13:38:00   2
 7:   4.12   3  327 2019-03-01 13:38:00   1
 8:   9.92   5  327 2021-03-01 13:38:00   2
 9:  26.54  10  327 2026-03-01 13:38:00   3
10:  37.82  15  327 2031-03-01 13:38:00   2
...
...

However, I need to scale this up to work on upwards of 5 million records, across approx 10,000 groups, and it takes an unfeasibly long time to run there. Is there a quicker and less clumsy method to do what I'm trying to do?

like image 945
TimBrown Avatar asked Mar 01 '16 14:03

TimBrown


1 Answers

Here's a possible solution using data.table::foverlaps. The idea here is to first join over the whole range of {Sys.time() - years(10), Sys.time() + years(age)}. Then, count only the instances when the difference is less <= 10 years.

DT <- as.data.table(Loblolly)
DT[, c("rd", "rd2") := Sys.time() + years(age)] # create identical columns so foverlaps will work
setkey(DT, Seed, rd, rd2) # key by all for same reason
DT2 <- DT[, .(Seed, rd = rd - years(10), rd2, indx = .I)] # create minum range, create index to store row number
DT[, ct := foverlaps(DT, DT2)[i.rd > rd, .N, by = indx]$N] # run foverlaps, subset by condition and count
head(DT, 10)
#     height age Seed                  rd                 rd2 ct
#  1:   3.93   3  329 2019-03-01 22:59:02 2019-03-01 22:59:02  1
#  2:   9.34   5  329 2021-03-01 22:59:02 2021-03-01 22:59:02  2
#  3:  26.08  10  329 2026-03-01 22:59:02 2026-03-01 22:59:02  3
#  4:  37.79  15  329 2031-03-01 22:59:02 2031-03-01 22:59:02  2
#  5:  48.31  20  329 2036-03-01 22:59:02 2036-03-01 22:59:02  2
#  6:  56.43  25  329 2041-03-01 22:59:02 2041-03-01 22:59:02  2
#  7:   4.12   3  327 2019-03-01 22:59:02 2019-03-01 22:59:02  1
#  8:   9.92   5  327 2021-03-01 22:59:02 2021-03-01 22:59:02  2
#  9:  26.54  10  327 2026-03-01 22:59:02 2026-03-01 22:59:02  3
# 10:  37.82  15  327 2031-03-01 22:59:02 2031-03-01 22:59:02  2

EDIT 17/3/2017:

Using data.table v1.10.4+ you can now use non-uqui joins combined with by = .EACHI. Which basically allows you to both join using >= and <= rather than just exact join and also run computations while joining (in order to avoid Cartesian joins like in your case) and return just the final result. So in your specific case you can just do

DT[, rd10 := rd - years(10)]
DT[, ct := DT[DT, .N, on = .(Seed, rd <= rd, rd > rd10), by = .EACHI]$N]
like image 158
David Arenburg Avatar answered Sep 30 '22 12:09

David Arenburg