I have a large dataframe(3M+ rows). I am trying to count the number of times a certain ActivityType appears in a 21 day window. I have modelled my solution from Rolling Sum by Another Variable in R. But it takes a long time just for one ActivityType. I did not think 3M+ rows is something that will take an inordinate amount of time. Below is what I tried:
dt <- read.table(text='
Name ActivityType ActivityDate
John Email 1/1/2014
John Email 1/3/2014
John Webinar 1/5/2014
John Webinar 1/20/2014
John Webinar 3/25/2014
John Email 4/1/2014
John Email 4/20/2014
Tom Email 1/1/2014
Tom Webinar 1/5/2014
Tom Webinar 1/20/2014
Tom Webinar 3/25/2014
Tom Email 4/1/2014
Tom Email 4/20/2014
', header=T, row.names = NULL)
library(data.table)
library(reshape2)
dt$ActivityType <- factor(dt$ActivityType)
dt$ActivityDate <- as.Date(dt$ActivityDate, "%m/%d/%Y")
dt <- dt[order(dt$Name, dt$ActivityDate),]
dt <- dcast(dt, Name + ActivityDate ~ ActivityType, fun.aggregate=length)
setDT(dt)
#Build reference table
Ref <- dt[,list(Compare_Value=list(I(Email)),Compare_Date=list(I(ActivityDate))), by=c("Name")]
#Use mapply to get last 21 days of value by Name
dt[,Email_RollingSum := mapply(ActivityDate=ActivityDate,Name=Name, function(ActivityDate, Name) {
d <- as.numeric(Ref$Compare_Date[[Name]] - ActivityDate)
sum((d <= 0 & d >= -21)*Ref$Compare_Value[[Name]])})]
And this is just for ActivityType=Email, then I have to do the same for other ActivityType levels. The link that I got the solution from talked about using "mcapply" rather than "mapply". Kindly let me know how I can use mcapply or any other solution that will make it faster.
Below is the expected output. For each row, I take the ActivityDate and 21 days before that and that 21 day period is my time window. I count all the time ActivityType="Email" appears in that time window.
Name ActivityType ActivityDate Email_RollingSum
John Email 1/1/2014 1
John Email 1/3/2014 2
John Webinar 1/5/2014 2
John Webinar 1/20/2014 2
John Webinar 3/25/2014 0
John Email 4/1/2014 1
John Email 4/20/2014 2
Tom Email 1/1/2014 1
Tom Webinar 1/5/2014 1
Tom Webinar 1/20/2014 1
Tom Webinar 3/25/2014 0
Tom Email 4/1/2014 1
Tom Email 4/20/2014 2
setDT(dt)
dt[, ActivityDate := as.Date(ActivityDate, '%m/%d/%Y')]
# add index to keep track of rows
dt[, idx := .I]
# match the dates we're looking for using a rolling join and extract the row numbers
rr = dt[.(Name = Name, ActivityDate = ActivityDate - 21, refIdx = idx),
.(idx, refIdx), on = c('Name', 'ActivityDate'), roll = -Inf]
# idx refIdx
# 1: 1 1
# 2: 1 2
# 3: 1 3
# 4: 1 4
# 5: 5 5
# 6: 5 6
# 7: 6 7
# 8: 8 8
# 9: 8 9
#10: 8 10
#11: 11 11
#12: 11 12
#13: 12 13
# extract the above rows and count occurrences using dcast
dcast(rr[, {seq = idx:refIdx; dt[seq]}, by = 1:nrow(rr)], nrow ~ ActivityType)
# nrow Email Webinar
#1 1 1 0
#2 2 2 0
#3 3 2 1
#4 4 2 2
#5 5 0 1
#6 6 1 1
#7 7 2 0
#8 8 1 0
#9 9 1 1
#10 10 1 2
#11 11 0 1
#12 12 1 1
#13 13 2 0
Try an approach in which the data table is used both for list of names and dates and for the source of number of emails. This is done in data.table
by using the DT
in the i
argument of DT
together with by = .EACHI
. Code could look like:
library(data.table)
# convert character dates to Date types
dt$ActivityDate <- as.Date(dt$ActivityDate, "%m/%d/%Y")
# convert to a 'data.table' and define key
setDT(dt, key = "Name")
# count emails and webinars
dt <- dt[dt[,.(Name, type = ActivityType, date = ActivityDate)],
.(type, date,
Email = sum(ActivityType == "Email" & between(ActivityDate, date-21, date)),
Webinar = sum(ActivityType == "Webinar" & between(ActivityDate, date-21, date))),
by=.EACHI]
The following uses the same approach as above but includes a few changes which may improve the speed by 30-40% depending upon your data.
setDT(dt, key = "Name")
dt[, ":="(ActivityDate = as.Date(dt$ActivityDate, "%m/%d/%Y"),
ActivityType = as.character(ActivityType) )]
dt4 <- dt[.(Name=Name, type=ActivityType, date=ActivityDate), {z=between(ActivityDate, date-21, date);
.( type, date,
Email=sum( (ActivityType %chin% "Email") & z),
Webinar=sum( (ActivityType %chin% "Webinar") & z) ) }
, by=.EACHI]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With