I am trying to calculate cumulative sum for a given window based on a condition. I have seen threads where the solution does conditional cumulative sum (Calculate a conditional running sum in R for every row in data frame) and rolling sum (Rolling Sum by Another Variable in R), but I couldn't find the two together. I also saw that data.table
doesn't have a rolling window function at R data.table sliding window. So, this problem is very challenging for me.
Moreover, the solution posted by Mike Grahan on rolling sum is beyond my comprehension. I am looking for data.table
based method primarily for speed. However, I am open to other methods if they are understandable.
Here's my input data:
DFI <- structure(list(FY = c(2011, 2012, 2013, 2015, 2016, 2011, 2011,
2012, 2013, 2014, 2015, 2010, 2016, 2013, 2014, 2015, 2010),
Customer = c(13575, 13575, 13575, 13575, 13575, 13575, 13575,
13575, 13575, 13575, 13575, 13578, 13578, 13578, 13578, 13578,
13578), Product = c("A", "A", "A", "A", "A", "B", "B", "B",
"B", "B", "B", "A", "A", "B", "C", "D", "E"), Rev = c(4,
3, 3, 1, 2, 1, 2, 3, 4, 5, 6, 3, 2, 2, 4, 2, 2)), .Names = c("FY",
"Customer", "Product", "Rev"), row.names = c(NA, 17L), class = "data.frame")
Here's my expected output: (Manually created; My apologies if there is a manual error)
DFO <- structure(list(FY = c(2011, 2012, 2013, 2015, 2016, 2011, 2012,
2013, 2014, 2015, 2010, 2016, 2013, 2014, 2015, 2010), Customer = c(13575,
13575, 13575, 13575, 13575, 13575, 13575, 13575, 13575, 13575,
13578, 13578, 13578, 13578, 13578, 13578), Product = c("A", "A",
"A", "A", "A", "B", "B", "B", "B", "B", "A", "A", "B", "C", "D",
"E"), Rev = c(4, 3, 3, 1, 2, 3, 3, 4, 5, 6, 3, 2, 2, 4, 2, 2),
cumsum = c(4, 7, 10, 11, 9, 3, 6, 10, 15, 21, 3, 2, 2, 4,
2, 2)), .Names = c("FY", "Customer", "Product", "Rev", "cumsum"
), row.names = c(NA, 16L), class = "data.frame")
Some commentary about the logic:
1) I want to find rolling sum in a 5-year period. Ideally, I would like this 5-year period to be variable i.e. something I can specify elsewhere in the code. This way, I have the liberty to vary the window later on for my analysis.
2) The end of Window is based on the maximum year (i.e. FY
in example above). In above example, the max FY
in DFI
is 2016
. So, starting year of the window would be 2016 - 5 + 1 = 2012
for all entries in 2016
.
3) The window sum (or running sum) is calculated by Customer
and for a specific Product
.
What I tried:
I wanted to try something before posting. Here's my code:
DFI <- data.table::as.data.table(DFI)
#Sort it first
DFI<-DFI[order(Customer,FY),]
#find cumulative sum; remove Rev column; order rows
DFOTest<-DFI[,cumsum := cumsum(Rev),by=.(Customer,Product)][,.SD[which.max(cumsum)],by=.(FY,Customer,Product)][,("Rev"):=NULL][order(Customer,Product,FY)]
This code calculates the cumulative sum, but I am unable to define 5-year window and then calculate running sum. I have two questions:
Question 1) How do I calculate a 5-year running sum?
Question 2) Can someone please explain Mike's method on this thread ? It seems to be fast. However, I am not really sure what's going on there. I did see that someone requested some commentary, but I am not sure whether it is self-explanatory.
Thanks in advance. I have been struggling on this problem for two days.
Calculate Cumulative Sum of a Numeric Object in R Programming – cumsum() Function. The cumulative sum can be defined as the sum of a set of numbers as the sum value grows with the sequence of numbers. cumsum() function in R Language is used to calculate the cumulative sum of the vector passed as argument.
Cumulative sums, or running totals, are used to display the total sum of data as it grows with time (or any other series or progression). This lets you view the total contribution so far of a given measure against time.
My solution stays on the tidyverse
side of things, however, if your source data is not excessive the performance difference may not be an issue.
I will start with declaring a function to calculate the rolling sum using tibbletime::rollify
and expand the data frame to include missing FY
values. Then group and summarise while applying the rolling sum.
library(tidyr)
library(dplyr)
rollsum_5 <- tibbletime::rollify(sum, window = 5)
df %>%
complete(FY, Customer, Product) %>%
replace_na(list(Rev = 0), Rev) %>%
arrange(Customer, Product, FY) %>%
group_by(Customer, Product, FY) %>%
summarise(Rev = sum(Rev)) %>%
mutate(cumsum = rollsum_5(Rev)) %>%
ungroup %>%
filter(Rev != 0)
# # A tibble: 16 x 5
# Customer Product FY Rev cumsum
# <dbl> <chr> <dbl> <dbl> <dbl>
# 1 13575 A 2011 4.00 NA
# 2 13575 A 2012 3.00 NA
# 3 13575 A 2013 3.00 NA
# 4 13575 A 2015 1.00 11.0
# 5 13575 A 2016 2.00 9.00
# 6 13575 B 2011 3.00 NA
# 7 13575 B 2012 3.00 NA
# 8 13575 B 2013 4.00 NA
# 9 13575 B 2014 5.00 15.0
# 10 13575 B 2015 6.00 21.0
# 11 13578 A 2010 3.00 NA
# 12 13578 A 2016 2.00 2.00
# 13 13578 B 2013 2.00 NA
# 14 13578 C 2014 4.00 4.00
# 15 13578 D 2015 2.00 2.00
# 16 13578 E 2010 2.00 NA
N.B. The rolling sum in this case will only appear in the rows where the window (5 rows) are intact. It could be misleading to suggest that partial values are equal to a five year sum.
1) rollapply Create a Sum
function which takes FY
and Rev
as a 2 column matrix (or if not makes it one) and then sums the revenues for those years within k
of the last year. Then convert DFI
to a data table, sum rows having the same Customer/Product/Year and run rollapplyr
with Sum
for each Customer/Product group.
library(data.table)
library(zoo)
k <- 5
Sum <- function(x) {
x <- matrix(x,, 2)
FY <- x[, 1]
Rev <- x[, 2]
ok <- FY >= tail(FY, 1) - k + 1
sum(Rev[ok])
}
DT <- as.data.table(DFI)
DT <- DT[, list(Rev = sum(Rev)), by = c("Customer", "Product", "FY")]
DT[, cumsum := rollapplyr(.SD, k, Sum, by.column = FALSE, partial = TRUE),
by = c("Customer", "Product"), .SDcols = c("FY", "Rev")]
giving:
> DT
Customer Product FY Rev cumsum
1: 13575 A 2011 4 4
2: 13575 A 2012 3 7
3: 13575 A 2013 3 10
4: 13575 A 2015 1 11
5: 13575 A 2016 2 9
6: 13575 B 2011 3 3
7: 13575 B 2012 3 6
8: 13575 B 2013 4 10
9: 13575 B 2014 5 15
10: 13575 B 2015 6 21
11: 13578 A 2010 3 3
12: 13578 A 2016 2 2
13: 13578 B 2013 2 2
14: 13578 C 2014 4 4
15: 13578 D 2015 2 2
16: 13578 E 2010 2 2
2) data.table only
First sum rows that have the same Customer/Product/FY and then, grouping by Customer/Product, for each FY value, fy
, pick out the Rev
values whose FY values are between fy-k+1
and fy
and sum.
library(data.table)
k <- 5
DT <- as.data.table(DFI)
DT <- DT[, list(Rev = sum(Rev)), by = c("Customer", "Product", "FY")]
DT[, cumsum := sapply(FY, function(fy) sum(Rev[between(FY, fy-k+1, fy)])),
by = c("Customer", "Product")]
giving:
> DT
Customer Product FY Rev cumsum
1: 13575 A 2011 4 4
2: 13575 A 2012 3 7
3: 13575 A 2013 3 10
4: 13575 A 2015 1 11
5: 13575 A 2016 2 9
6: 13575 B 2011 3 3
7: 13575 B 2012 3 6
8: 13575 B 2013 4 10
9: 13575 B 2014 5 15
10: 13575 B 2015 6 21
11: 13578 A 2010 3 3
12: 13578 A 2016 2 2
13: 13578 B 2013 2 2
14: 13578 C 2014 4 4
15: 13578 D 2015 2 2
16: 13578 E 2010 2 2
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With