I have a dataframe 'DFrame' that looks like this:
RecordNo | Cust_ID | Record_Date
1 | 023 | 2014-03-01
2 | 056 | 2014-01-18
3 | 041 | 2014-03-04
4 | 023 | 2014-03-21
5 | 056 | 2014-01-25
6 | 003 | 2014-03-01
7 | 023 | 2014-04-01
8 | 023 | 2014-04-02
I would like to add a column that shows a count of how many times a record by the same customer ID appears in the following 14 days of the current record_date.
RecordNo | Cust_ID | Record_Date | 14-day_Repeat_Count
1 | 023 | 2014-03-01 | 0
2 | 056 | 2014-01-18 | 1
3 | 041 | 2014-03-04 | 0
4 | 023 | 2014-03-21 | 2
5 | 056 | 2014-01-25 | 0
6 | 003 | 2014-03-01 | 0
7 | 023 | 2014-04-01 | 1
8 | 023 | 2014-04-02 | 0
I am trying to write fast code in R to accomplish this. I have found a couple of articles that made counting records that meet conditions look easy, but they generally only point to static conditions or conditions not related to the value of the current record: http://one-line-it.blogspot.ca/2013/01/r-number-of-rows-matching-condition.html
I imagine the logic may look like:
# Sort DFRAME by RECORD_DATE decreasing=FALSE
......
# Count records below current record where that have matching Cust_ID
# AND the Record_Date is <= 14 days of the current Record_Date
# put result into DFrame$14-day_Repeat_Count
......
I have done this type of logic in DAX:
=calculate(counta([Cust_ID],
filter(DFrame,
[Cust_ID]=Earlier([Cust_ID]) &&
[Record_Date] > Earlier([Record_Date]) &&
[Record_Date] <= (Earlier([Record_Date]) + 14)
)
)
(very fast but proprietary to Microsoft), and in Excel using 'CountIfs' (easy to implement, very very slow, and again requires marriage to Microsoft), can anyone point me to some reading on how one might count based on criteria in R?
Perhaps a faster, more memory efficient answer might look something like this:
##Combine into one data.table
library("data.table")
RecordNo <- 1:36
Record_Date <- c(31,33,38,41,44,59,68,69,75,78,85,88,
32,34,45,46,51,54,60,65,67,70,74,80,
33,35,42,45,50,60,65,70,75,80,82,85)
Cust_ID <- c(rep(1,12),rep(2,12),rep(3,12))
data <- data.table(Cust_ID,Record_Date)[order(Cust_ID,Record_Date)]
##Assign each customer an number that ranks them
data[,Cust_No:=.GRP,by=c("Cust_ID")]
##Create "list" of comparison dates for each customer
Ref <- data[,list(Compare_Date=list(I(Record_Date))), by=c("Cust_ID")]
##Compare two lists and see of the compare date is within N days
system.time(
data$Roll.Cnt <- mapply(FUN = function(RD, NUM) {
d <- as.numeric(Ref$Compare_Date[[NUM]] - RD)
sum((d > 0 & d <= 14))
}, RD = data$Record_Date,NUM=data$Cust_No)
)
The resulting data looks like the following:
data <- data[,list(Cust_ID,Record_Date,Roll.Cnt)][order(Cust_ID,Record_Date)]
data
Cust_ID Record_Date Roll.Cnt
1: 1 31 4
2: 1 33 3
3: 1 38 2
4: 1 41 1
5: 1 44 0
6: 1 59 2
7: 1 68 3
8: 1 69 2
9: 1 75 3
10: 1 78 2
11: 1 85 1
12: 1 88 0
13: 2 32 3
14: 2 34 2
15: 2 45 3
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With