I have a data set customerId, transactionDate, productId, purchaseQty loaded into a data.table. for each row, I want to calculate the sum, and mean of purchaseQty for the prior 45 day
productId customerID transactionDate purchaseQty
1: 870826 1186951 2016-03-28 162000
2: 870826 1244216 2016-03-31 5000
3: 870826 1244216 2016-04-08 6500
4: 870826 1308671 2016-03-28 221367
5: 870826 1308671 2016-03-29 83633
6: 870826 1308671 2016-11-29 60500
I'm looking for an output like this:
productId customerID transactionDate purchaseQty sumWindowPurchases
1: 870826 1186951 2016-03-28 162000 162000
2: 870826 1244216 2016-03-31 5000 5000
3: 870826 1244216 2016-04-08 6500 11500
4: 870826 1308671 2016-03-28 221367 221367
5: 870826 1308671 2016-03-29 83633 305000
6: 870826 1308671 2016-11-29 60500 60500
so, sumWindowPurchases contains the sum of purchaseQty for the customer/product over a 45 day window from the current transaction date. Once i have that working, throwing the mean, and other calcs I need should be trivial
I went back to my SQL roots and thought of a self join:
select DT.customerId, DT.transactionDate, DT.productId, sum(DT1.purchaseQty)
from DT
inner join DT as DT1 on
DT.customerId = DT1.customerId
and DT.productId = DT1.productId
and DT1.transactionDate between DT.transactionDate and dateadd(day, -45, DT.transactionDate)
Trying to translate that into R using data.dable syntax, I was hoping to do something like this:
DT1 <- DT #alias. have confirmed this is just a pointer
DT[DT1[DT1$transactionDate >= DT$transactionDate - 45],
.(sum(DT1$purchaseQty)),
by = .(DT$customerId , DT$transactionDate ),
on = .(customerId , DT1$transactionDate <= DT$TransactionDate),
allow.cartesian = TRUE]
I guess I have a 2 part question. What is the "R way" to do this. Is a data.table self join the correct approach, or woudl i be better of trying to use the Reduce function?
I suspect the self join is the only way to get the rolling 45 day window in there. so part 2 is I need some help with the data.table syntax to explicitly reference which source table the column comes from, since its a self join and they have the same column names.
Ive been studying the answers that Frank linked to and have come up with this expression
DT[.(p = productId, c = customerID, t = transactionDate, start = transactionDate - 45),
on = .(productId==p, customerID==c, transactionDate<=t, transactionDate>=start),
allow.cartesian = TRUE, nomatch = 0]
which produces this output:
productId customerID transactionDate purchaseQty transactionDate.1
1: 870826 1186951 2016-03-28 162000 2016-02-12
2: 870826 1244216 2016-03-31 5000 2016-02-15
3: 870826 1244216 2016-04-08 5000 2016-02-23
4: 870826 1244216 2016-04-08 6500 2016-02-23
5: 870826 1308671 2016-03-28 221367 2016-02-12
6: 870826 1308671 2016-03-29 221367 2016-02-13
7: 870826 1308671 2016-03-29 83633 2016-02-13
8: 870826 1308671 2016-11-29 60500 2016-10-15
This is very close, to what i need to get to my final step. if i could sum the purchase quantities of this output, group by customer/product/transactionDate.1, i would have something useful. however, I cant get the syntax down for that, not do I understand where the transactionDate.1 name is coming from
First, we find how many transaction dates occur in 45 day window prior to the current date (including current date)
setDT(df)
df[, n:= 1:.N - findInterval(transactionDate - 45, transactionDate), by=.(customerID)]
df
# productId customerID transactionDate purchaseQty n
#1: 870826 1186951 2016-03-28 162000 1
#2: 870826 1244216 2016-03-31 5000 1
#3: 870826 1244216 2016-04-08 6500 2
#4: 870826 1308671 2016-03-28 221367 1
#5: 870826 1308671 2016-03-29 83633 2
#6: 870826 1308671 2016-11-29 60500 1
Next we find a rolling sum of purchaseQty
with window size n
. Adopting a great answer here
g <- function(x, window){
b_pos <- seq_along(x) - window + 1 # begin positions
cum <- cumsum(x)
cum - cum[b_pos] + x[b_pos]
}
df[, sumWindowPurchases := g(purchaseQty, n),][,n:=NULL,]
df
# productId customerID transactionDate purchaseQty sumWindowPurchases
#1: 870826 1186951 2016-03-28 162000 162000
#2: 870826 1244216 2016-03-31 5000 5000
#3: 870826 1244216 2016-04-08 6500 11500
#4: 870826 1308671 2016-03-28 221367 221367
#5: 870826 1308671 2016-03-29 83633 305000
#6: 870826 1308671 2016-11-29 60500 60500
structure(list(productId = c(870826L, 870826L, 870826L, 870826L,
870826L, 870826L), customerID = c(1186951L, 1244216L, 1244216L,
1308671L, 1308671L, 1308671L), transactionDate = structure(c(16888,
16891, 16899, 16888, 16889, 17134), class = "Date"), purchaseQty = c(162000L,
5000L, 6500L, 221367L, 83633L, 60500L)), .Names = c("productId",
"customerID", "transactionDate", "purchaseQty"), row.names = c("1:",
"2:", "3:", "4:", "5:", "6:"), class = "data.frame")
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With