Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Count the number of unique dates per ID between 2 dates, conditionally

Tags:

date

dataframe

r

I have a main table that contains the dates of main events for each personid:

dfMain <- data.frame(last    = c("2017-08-01", "2017-08-01", "2017-08-05","2017-09-02","2017-09-02"),
                 previous    = c(NA, NA, "2017-08-01", "2017-08-05", "2017-08-01"),
                 personid    = c(12341, 122345, 12341, 12341, 122345),
                 diff        = c(NA, NA, 4, 28, 32))

(The NAs on the "previous" and "diff" variables indicate that this personid had his first "main even" i.e: no previous dates and no time difference)

I also have a secondary table that consists of a "secondary event" for each personid:

dfSecondary <- data.frame(date = c("2017-09-01", "2017-08-30", "2017-08-04", "2017-08-02", "2017-08-02"),
                      personid = c(122345, 122345, 12341, 122345, 12341))

My question is, What is the optimal way (due to the volume of my data) to augment my "dfMain" data frame with the number of unique secondary events between main event dates for each personid.

In the dummy example my objective is to get this table:

Occurances  <- c(NA, NA, 2, 0, 3)
dfObjective <- data.frame(dfMain, Occurances)
like image 324
Vasilis Vasileiou Avatar asked Dec 11 '22 10:12

Vasilis Vasileiou


1 Answers

Using the data.table-package:

# load 'data.table' package and convert date-columns to date-class
library(data.table)
setDT(dfMain)[, 1:2 := lapply(.SD, as.IDate), .SDcols = 1:2][]
setDT(dfSecondary)[, date := as.IDate(date)][]

# create a reference
dfSecondary <- dfSecondary[dfMain
                           , on = .(personid, date > previous, date < last)
                           , .(dates = x.date)
                           , by = .EACHI]
setnames(dfSecondary, 2:3, c('previous','last'))

# join and summarise
dfMain[na.omit(dfSecondary, cols = 1:3)[, sum(!is.na(dates), na.rm = TRUE)
                                        , by = .(personid, previous, last)]
       , on = .(personid, previous, last)
       , Occ := V1][]

which gives:

         last   previous personid diff Occ
1: 2017-08-01       <NA>    12341   NA  NA
2: 2017-08-01       <NA>   122345   NA  NA
3: 2017-08-05 2017-08-01    12341    4   2
4: 2017-09-02 2017-08-05    12341   28   0
5: 2017-09-02 2017-08-01   122345   32   3
like image 97
Jaap Avatar answered Dec 28 '22 22:12

Jaap