I have a main table that contains the dates of main events for each personid:
dfMain <- data.frame(last = c("2017-08-01", "2017-08-01", "2017-08-05","2017-09-02","2017-09-02"),
previous = c(NA, NA, "2017-08-01", "2017-08-05", "2017-08-01"),
personid = c(12341, 122345, 12341, 12341, 122345),
diff = c(NA, NA, 4, 28, 32))
(The NAs on the "previous" and "diff" variables indicate that this personid had his first "main even" i.e: no previous dates and no time difference)
I also have a secondary table that consists of a "secondary event" for each personid:
dfSecondary <- data.frame(date = c("2017-09-01", "2017-08-30", "2017-08-04", "2017-08-02", "2017-08-02"),
personid = c(122345, 122345, 12341, 122345, 12341))
My question is, What is the optimal way (due to the volume of my data) to augment my "dfMain" data frame with the number of unique secondary events between main event dates for each personid.
In the dummy example my objective is to get this table:
Occurances <- c(NA, NA, 2, 0, 3)
dfObjective <- data.frame(dfMain, Occurances)
Using the data.table
-package:
# load 'data.table' package and convert date-columns to date-class
library(data.table)
setDT(dfMain)[, 1:2 := lapply(.SD, as.IDate), .SDcols = 1:2][]
setDT(dfSecondary)[, date := as.IDate(date)][]
# create a reference
dfSecondary <- dfSecondary[dfMain
, on = .(personid, date > previous, date < last)
, .(dates = x.date)
, by = .EACHI]
setnames(dfSecondary, 2:3, c('previous','last'))
# join and summarise
dfMain[na.omit(dfSecondary, cols = 1:3)[, sum(!is.na(dates), na.rm = TRUE)
, by = .(personid, previous, last)]
, on = .(personid, previous, last)
, Occ := V1][]
which gives:
last previous personid diff Occ 1: 2017-08-01 <NA> 12341 NA NA 2: 2017-08-01 <NA> 122345 NA NA 3: 2017-08-05 2017-08-01 12341 4 2 4: 2017-09-02 2017-08-05 12341 28 0 5: 2017-09-02 2017-08-01 122345 32 3
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With