Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

aggregate/merge over date range using data.table

Tags:

r

data.table

Suppose I have two data.tables:

summary <- data.table(period = c("A","B","C","D"),
                 from_date = ymd(c("2017-01-01", "2017-01-03", "2017-02-08", "2017-03-07")),
                 to_date = ymd(c("2017-01-31", "2017-04-01", "2017-03-08", "2017-05-01"))
)

log <- data.table(date = ymd(c("2017-01-03","2017-01-20","2017-02-01","2017-03-03",
                               "2017-03-15","2017-03-28","2017-04-03","2017-04-23")),
                  event1 = c(4,8,8,4,3,4,7,3), event2 = c(1,8,7,3,8,4,6,3))

which look like this:

> summary
   period  from_date    to_date
1:      A 2017-01-01 2017-01-31
2:      B 2017-01-03 2017-04-01
3:      C 2017-02-08 2017-03-08
4:      D 2017-03-07 2017-05-01
> log
         date event1 event2
1: 2017-01-03      4      1
2: 2017-01-20      8      8
3: 2017-02-01      8      7
4: 2017-03-03      4      3
5: 2017-03-15      3      8
6: 2017-03-28      4      4
7: 2017-04-03      7      6
8: 2017-04-23      3      3

I would like to get the sum of event1 and event2 for each time period in the table summary.

I know I can do this:

summary[, c("event1","event2") := .(sum(log[date>=from_date & date<=to_date, event1]),
                               sum(log[date>=from_date & date<=to_date, event2]))
   , by=period][]

to get the desired result:

   period  from_date    to_date event1 event2
1:      A 2017-01-01 2017-01-31     12      9
2:      B 2017-01-03 2017-04-01     31     31
3:      C 2017-02-08 2017-03-08      4      3
4:      D 2017-03-07 2017-05-01     17     21

Now, in my real-life problem, I have about 30 columns to be summed, which I may want to change later, and summary has ~35,000 rows, log has ~40,000,000 rows. Is there an efficient way to achieve this?

Note: This is my first post here. I hope my question is clear and specific enough, please do make suggestions if there is anything I should do to improve the question. Thanks!

like image 506
Glen Ng Avatar asked Nov 16 '17 10:11

Glen Ng


2 Answers

Yes, you can perform a non-equi join.

(Note I've changed log and summary to Log and Summary as the originals are already functions in R.)

Log[Summary,
   on = c("date>=from_date", "date<=to_date"),
   nomatch=0L, 
   allow.cartesian = TRUE][, .(from_date = date[1],
                               to_date = date.1[1],
                               event1 = sum(event1),
                               event2 = sum(event2)),
                           keyby = "period"]

To sum over a pattern of columns, use lapply with .SD:

joined_result <- 
  Log[Summary,
      on = c("date>=from_date", "date<=to_date"),
      nomatch = 0L, 
      allow.cartesian = TRUE]

cols <- grep("event[a-z]?[0-9]", names(joined_result), value = TRUE)

joined_result[, lapply(.SD, sum),
              .SDcols = cols,
              keyby = .(period,
                        from_date = date,
                        to_date = date.1)]
like image 184
Hugh Avatar answered Nov 17 '22 16:11

Hugh


With data.table, it is possible to aggregate during a non-equi join using by = .EACHI.

log[summary, on = .(date >= from_date, date <= to_date), nomatch=0L, 
    lapply(.SD, sum), by = .EACHI]
         date       date event1 event2
1: 2017-01-01 2017-01-31     12      9
2: 2017-01-03 2017-04-01     31     31
3: 2017-02-08 2017-03-08      4      3
4: 2017-03-07 2017-05-01     17     21

With some additional clean-up:

log[summary, on = .(date >= from_date, date <= to_date), nomatch=0L, 
    c(period = period, lapply(.SD, sum)), by = .EACHI][
      , setnames(.SD, 1:2, c("from_date", "to_date"))]
    from_date    to_date period event1 event2
1: 2017-01-01 2017-01-31      A     12      9
2: 2017-01-03 2017-04-01      B     31     31
3: 2017-02-08 2017-03-08      C      4      3
4: 2017-03-07 2017-05-01      D     17     21
like image 4
Uwe Avatar answered Nov 17 '22 14:11

Uwe