Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R - count how many rows in a data frame have same value and date is within x days

Tags:

r

I have a dataframe 'DFrame' that looks like this:

RecordNo |  Cust_ID  |  Record_Date
1        |  023      |  2014-03-01
2        |  056      |  2014-01-18
3        |  041      |  2014-03-04
4        |  023      |  2014-03-21
5        |  056      |  2014-01-25
6        |  003      |  2014-03-01
7        |  023      |  2014-04-01
8        |  023      |  2014-04-02

I would like to add a column that shows a count of how many times a record by the same customer ID appears in the following 14 days of the current record_date.

RecordNo |  Cust_ID  |  Record_Date  | 14-day_Repeat_Count
1        |  023      |  2014-03-01   | 0
2        |  056      |  2014-01-18   | 1
3        |  041      |  2014-03-04   | 0
4        |  023      |  2014-03-21   | 2
5        |  056      |  2014-01-25   | 0
6        |  003      |  2014-03-01   | 0
7        |  023      |  2014-04-01   | 1
8        |  023      |  2014-04-02   | 0

I am trying to write fast code in R to accomplish this. I have found a couple of articles that made counting records that meet conditions look easy, but they generally only point to static conditions or conditions not related to the value of the current record: http://one-line-it.blogspot.ca/2013/01/r-number-of-rows-matching-condition.html

I imagine the logic may look like:

# Sort DFRAME by RECORD_DATE decreasing=FALSE
......
# Count records below current record where that have matching Cust_ID
# AND the Record_Date is <= 14 days of the current Record_Date

# put result into DFrame$14-day_Repeat_Count
......

I have done this type of logic in DAX:

=calculate(counta([Cust_ID],
      filter(DFrame,
             [Cust_ID]=Earlier([Cust_ID]) &&
             [Record_Date] > Earlier([Record_Date]) &&
             [Record_Date] <= (Earlier([Record_Date]) + 14)
      )
)

(very fast but proprietary to Microsoft), and in Excel using 'CountIfs' (easy to implement, very very slow, and again requires marriage to Microsoft), can anyone point me to some reading on how one might count based on criteria in R?

like image 881
timonippon Avatar asked Dec 20 '22 15:12

timonippon


1 Answers

Perhaps a faster, more memory efficient answer might look something like this:

##Combine into one data.table
library("data.table")
RecordNo <- 1:36
Record_Date <- c(31,33,38,41,44,59,68,69,75,78,85,88,
                 32,34,45,46,51,54,60,65,67,70,74,80,
                 33,35,42,45,50,60,65,70,75,80,82,85)
Cust_ID <- c(rep(1,12),rep(2,12),rep(3,12))
data <- data.table(Cust_ID,Record_Date)[order(Cust_ID,Record_Date)]

##Assign each customer an number that ranks them
data[,Cust_No:=.GRP,by=c("Cust_ID")]

##Create "list" of comparison dates for each customer
Ref <- data[,list(Compare_Date=list(I(Record_Date))), by=c("Cust_ID")]

##Compare two lists and see of the compare date is within N days
system.time(
  data$Roll.Cnt <- mapply(FUN = function(RD, NUM) {
    d <- as.numeric(Ref$Compare_Date[[NUM]] - RD)
    sum((d > 0 & d <= 14))
  }, RD = data$Record_Date,NUM=data$Cust_No)
)

The resulting data looks like the following:

data <- data[,list(Cust_ID,Record_Date,Roll.Cnt)][order(Cust_ID,Record_Date)]
data

     Cust_ID   Record_Date  Roll.Cnt
 1:       1          31        4
 2:       1          33        3
 3:       1          38        2
 4:       1          41        1
 5:       1          44        0
 6:       1          59        2
 7:       1          68        3
 8:       1          69        2
 9:       1          75        3
10:       1          78        2
11:       1          85        1
12:       1          88        0
13:       2          32        3
14:       2          34        2
15:       2          45        3
like image 161
Mike.Gahan Avatar answered Apr 30 '23 21:04

Mike.Gahan