Logo Questions Linux Laravel Mysql Ubuntu Git Menu

Fastest way for doing 21 day rolling sum for an ActivityType

I have a large dataframe(3M+ rows). I am trying to count the number of times a certain ActivityType appears in a 21 day window. I have modelled my solution from Rolling Sum by Another Variable in R. But it takes a long time just for one ActivityType. I did not think 3M+ rows is something that will take an inordinate amount of time. Below is what I tried:

dt <- read.table(text='

                         Name      ActivityType     ActivityDate                
                         John       Email            1/1/2014           
                         John       Email            1/3/2014                
                         John       Webinar          1/5/2014          
                         John       Webinar          1/20/2014          
                         John       Webinar          3/25/2014          
                         John       Email            4/1/2014           
                         John       Email            4/20/2014          
                         Tom        Email            1/1/2014           
                         Tom       Webinar           1/5/2014           
                         Tom       Webinar           1/20/2014          
                         Tom       Webinar           3/25/2014          
                         Tom       Email             4/1/2014           
                         Tom       Email             4/20/2014          

                         ', header=T, row.names = NULL)

        dt$ActivityType <- factor(dt$ActivityType)   
        dt$ActivityDate <- as.Date(dt$ActivityDate, "%m/%d/%Y")  
        dt <- dt[order(dt$Name, dt$ActivityDate),]

   dt <- dcast(dt, Name + ActivityDate ~ ActivityType, fun.aggregate=length)
   #Build reference table
        Ref <- dt[,list(Compare_Value=list(I(Email)),Compare_Date=list(I(ActivityDate))), by=c("Name")]
    #Use mapply to get last 21 days of value by Name    
    dt[,Email_RollingSum := mapply(ActivityDate=ActivityDate,Name=Name, function(ActivityDate, Name) {
            d <- as.numeric(Ref$Compare_Date[[Name]] - ActivityDate)
            sum((d <= 0 & d >= -21)*Ref$Compare_Value[[Name]])})]

And this is just for ActivityType=Email, then I have to do the same for other ActivityType levels. The link that I got the solution from talked about using "mcapply" rather than "mapply". Kindly let me know how I can use mcapply or any other solution that will make it faster.

Below is the expected output. For each row, I take the ActivityDate and 21 days before that and that 21 day period is my time window. I count all the time ActivityType="Email" appears in that time window.

              Name      ActivityType     ActivityDate  Email_RollingSum             
                 John       Email            1/1/2014         1  
                 John       Email            1/3/2014         2       
                 John       Webinar          1/5/2014         2 
                 John       Webinar          1/20/2014        2  
                 John       Webinar          3/25/2014        0  
                 John       Email            4/1/2014         1  
                 John       Email            4/20/2014        2 
                 Tom        Email            1/1/2014         1  
                 Tom       Webinar           1/5/2014         1  
                 Tom       Webinar           1/20/2014        1  
                 Tom       Webinar           3/25/2014        0  
                 Tom       Email             4/1/2014         1  
                 Tom       Email             4/20/2014        2
like image 662
gibbz00 Avatar asked Dec 24 '15 17:12


Video Answer

2 Answers

dt[, ActivityDate := as.Date(ActivityDate, '%m/%d/%Y')]

# add index to keep track of rows
dt[, idx := .I]

# match the dates we're looking for using a rolling join and extract the row numbers
rr = dt[.(Name = Name, ActivityDate = ActivityDate - 21, refIdx = idx),
       .(idx, refIdx), on = c('Name', 'ActivityDate'), roll = -Inf]
#    idx refIdx
# 1:   1      1
# 2:   1      2
# 3:   1      3
# 4:   1      4
# 5:   5      5
# 6:   5      6
# 7:   6      7
# 8:   8      8
# 9:   8      9
#10:   8     10
#11:  11     11
#12:  11     12
#13:  12     13

# extract the above rows and count occurrences using dcast
dcast(rr[, {seq = idx:refIdx; dt[seq]}, by = 1:nrow(rr)], nrow ~ ActivityType)
#   nrow Email Webinar
#1     1     1       0
#2     2     2       0
#3     3     2       1
#4     4     2       2
#5     5     0       1
#6     6     1       1
#7     7     2       0
#8     8     1       0
#9     9     1       1
#10   10     1       2
#11   11     0       1
#12   12     1       1
#13   13     2       0
like image 52
eddi Avatar answered Nov 02 '22 02:11


Try an approach in which the data table is used both for list of names and dates and for the source of number of emails. This is done in data.table by using the DT in the i argument of DT together with by = .EACHI. Code could look like:

# convert character dates to Date types
dt$ActivityDate <- as.Date(dt$ActivityDate, "%m/%d/%Y") 
# convert to a 'data.table' and define key
setDT(dt, key = "Name")
# count emails and webinars
dt <- dt[dt[,.(Name, type = ActivityType, date = ActivityDate)],
         .(type, date,
           Email = sum(ActivityType == "Email" & between(ActivityDate, date-21, date)),
           Webinar = sum(ActivityType == "Webinar" & between(ActivityDate, date-21, date))),

The following uses the same approach as above but includes a few changes which may improve the speed by 30-40% depending upon your data.

  setDT(dt, key = "Name")
  dt[, ":="(ActivityDate = as.Date(dt$ActivityDate, "%m/%d/%Y"),
            ActivityType = as.character(ActivityType) )]
  dt4 <- dt[.(Name=Name,  type=ActivityType, date=ActivityDate), {z=between(ActivityDate, date-21, date);
                                                                  .( type, date,  
                                                                     Email=sum( (ActivityType %chin% "Email") & z),
                                                                     Webinar=sum( (ActivityType %chin% "Webinar") & z) ) }
            , by=.EACHI]
like image 40
WaltS Avatar answered Nov 02 '22 02:11
