I am trying to calculate cumulative sum for a given window based on a condition. I have seen threads where the solution does conditional cumulative sum (Calculate a conditional running sum in R for every row in data frame) and rolling sum (Rolling Sum by Another Variable in R), but I couldn't find the two together. I also saw that <code>data.table</code> doesn't have a rolling window function at R data.table sliding window. So, this problem is very challenging for me. Moreover, the solution posted by Mike Grahan on rolling sum is beyond my comprehension. I am looking for <code>data.table</code> based method primarily for speed. However, I am open to other methods if they are understandable. Here's my input data: <pre class="prettyprint"><code>DFI <- structure(list(FY = c(2011, 2012, 2013, 2015, 2016, 2011, 2011, 2012, 2013, 2014, 2015, 2010, 2016, 2013, 2014, 2015, 2010), Customer = c(13575, 13575, 13575, 13575, 13575, 13575, 13575, 13575, 13575, 13575, 13575, 13578, 13578, 13578, 13578, 13578, 13578), Product = c("A", "A", "A", "A", "A", "B", "B", "B", "B", "B", "B", "A", "A", "B", "C", "D", "E"), Rev = c(4, 3, 3, 1, 2, 1, 2, 3, 4, 5, 6, 3, 2, 2, 4, 2, 2)), .Names = c("FY", "Customer", "Product", "Rev"), row.names = c(NA, 17L), class = "data.frame") </code></pre> Here's my expected output: (Manually created; My apologies if there is a manual error) <pre class="prettyprint"><code>DFO <- structure(list(FY = c(2011, 2012, 2013, 2015, 2016, 2011, 2012, 2013, 2014, 2015, 2010, 2016, 2013, 2014, 2015, 2010), Customer = c(13575, 13575, 13575, 13575, 13575, 13575, 13575, 13575, 13575, 13575, 13578, 13578, 13578, 13578, 13578, 13578), Product = c("A", "A", "A", "A", "A", "B", "B", "B", "B", "B", "A", "A", "B", "C", "D", "E"), Rev = c(4, 3, 3, 1, 2, 3, 3, 4, 5, 6, 3, 2, 2, 4, 2, 2), cumsum = c(4, 7, 10, 11, 9, 3, 6, 10, 15, 21, 3, 2, 2, 4, 2, 2)), .Names = c("FY", "Customer", "Product", "Rev", "cumsum" ), row.names = c(NA, 16L), class = "data.frame") </code></pre> Some commentary about the logic: 1) I want to find rolling sum in a 5-year period. Ideally, I would like this 5-year period to be variable i.e. something I can specify elsewhere in the code. This way, I have the liberty to vary the window later on for my analysis. 2) The end of Window is based on the maximum year (i.e. <code>FY</code> in example above). In above example, the max <code>FY</code> in <code>DFI</code> is <code>2016</code>. So, starting year of the window would be <code>2016 - 5 + 1 = 2012</code> for all entries in <code>2016</code>. 3) The window sum (or running sum) is calculated by <code>Customer</code> and for a specific <code>Product</code>. What I tried: I wanted to try something before posting. Here's my code: <pre class="prettyprint"><code> DFI <- data.table::as.data.table(DFI) #Sort it first DFI<-DFI[order(Customer,FY),] #find cumulative sum; remove Rev column; order rows DFOTest<-DFI[,cumsum := cumsum(Rev),by=.(Customer,Product)][,.SD[which.max(cumsum)],by=.(FY,Customer,Product)][,("Rev"):=NULL][order(Customer,Product,FY)] </code></pre> This code calculates the cumulative sum, but I am unable to define 5-year window and then calculate running sum. I have two questions: Question 1) How do I calculate a 5-year running sum? Question 2) Can someone please explain Mike's method on this thread ? It seems to be fast. However, I am not really sure what's going on there. I did see that someone requested some commentary, but I am not sure whether it is self-explanatory. Thanks in advance. I have been struggling on this problem for two days.

My solution stays on the <code>tidyverse</code> side of things, however, if your source data is not excessive the performance difference may not be an issue. I will start with declaring a function to calculate the rolling sum using <code>tibbletime::rollify</code> and expand the data frame to include missing <code>FY</code> values. Then group and summarise while applying the rolling sum. <pre class="prettyprint"><code>library(tidyr) library(dplyr) rollsum_5 <- tibbletime::rollify(sum, window = 5) df %>% complete(FY, Customer, Product) %>% replace_na(list(Rev = 0), Rev) %>% arrange(Customer, Product, FY) %>% group_by(Customer, Product, FY) %>% summarise(Rev = sum(Rev)) %>% mutate(cumsum = rollsum_5(Rev)) %>% ungroup %>% filter(Rev != 0) # # A tibble: 16 x 5 # Customer Product FY Rev cumsum # <dbl> <chr> <dbl> <dbl> <dbl> # 1 13575 A 2011 4.00 NA # 2 13575 A 2012 3.00 NA # 3 13575 A 2013 3.00 NA # 4 13575 A 2015 1.00 11.0 # 5 13575 A 2016 2.00 9.00 # 6 13575 B 2011 3.00 NA # 7 13575 B 2012 3.00 NA # 8 13575 B 2013 4.00 NA # 9 13575 B 2014 5.00 15.0 # 10 13575 B 2015 6.00 21.0 # 11 13578 A 2010 3.00 NA # 12 13578 A 2016 2.00 2.00 # 13 13578 B 2013 2.00 NA # 14 13578 C 2014 4.00 4.00 # 15 13578 D 2015 2.00 2.00 # 16 13578 E 2010 2.00 NA </code></pre> <blockquote> N.B. The rolling sum in this case will only appear in the rows where the window (5 rows) are intact. It could be misleading to suggest that partial values are equal to a five year sum. </blockquote>

Cumulative sum in a window (or running window sum) based on a condition in R

Tags:

r

data.table

dplyr

I am trying to calculate cumulative sum for a given window based on a condition. I have seen threads where the solution does conditional cumulative sum (Calculate a conditional running sum in R for every row in data frame) and rolling sum (Rolling Sum by Another Variable in R), but I couldn't find the two together. I also saw that data.table doesn't have a rolling window function at R data.table sliding window. So, this problem is very challenging for me.

Moreover, the solution posted by Mike Grahan on rolling sum is beyond my comprehension. I am looking for data.table based method primarily for speed. However, I am open to other methods if they are understandable.

Here's my input data:

DFI <- structure(list(FY = c(2011, 2012, 2013, 2015, 2016, 2011, 2011, 
2012, 2013, 2014, 2015, 2010, 2016, 2013, 2014, 2015, 2010), 
    Customer = c(13575, 13575, 13575, 13575, 13575, 13575, 13575, 
    13575, 13575, 13575, 13575, 13578, 13578, 13578, 13578, 13578, 
    13578), Product = c("A", "A", "A", "A", "A", "B", "B", "B", 
    "B", "B", "B", "A", "A", "B", "C", "D", "E"), Rev = c(4, 
    3, 3, 1, 2, 1, 2, 3, 4, 5, 6, 3, 2, 2, 4, 2, 2)), .Names = c("FY", 
"Customer", "Product", "Rev"), row.names = c(NA, 17L), class = "data.frame")

Here's my expected output: (Manually created; My apologies if there is a manual error)

DFO <- structure(list(FY = c(2011, 2012, 2013, 2015, 2016, 2011, 2012, 
2013, 2014, 2015, 2010, 2016, 2013, 2014, 2015, 2010), Customer = c(13575, 
13575, 13575, 13575, 13575, 13575, 13575, 13575, 13575, 13575, 
13578, 13578, 13578, 13578, 13578, 13578), Product = c("A", "A", 
"A", "A", "A", "B", "B", "B", "B", "B", "A", "A", "B", "C", "D", 
"E"), Rev = c(4, 3, 3, 1, 2, 3, 3, 4, 5, 6, 3, 2, 2, 4, 2, 2), 
    cumsum = c(4, 7, 10, 11, 9, 3, 6, 10, 15, 21, 3, 2, 2, 4, 
    2, 2)), .Names = c("FY", "Customer", "Product", "Rev", "cumsum"
), row.names = c(NA, 16L), class = "data.frame")

Some commentary about the logic:

1) I want to find rolling sum in a 5-year period. Ideally, I would like this 5-year period to be variable i.e. something I can specify elsewhere in the code. This way, I have the liberty to vary the window later on for my analysis.

2) The end of Window is based on the maximum year (i.e. FY in example above). In above example, the max FY in DFI is 2016. So, starting year of the window would be 2016 - 5 + 1 = 2012 for all entries in 2016.

3) The window sum (or running sum) is calculated by Customer and for a specific Product.

What I tried:

I wanted to try something before posting. Here's my code:

  DFI <- data.table::as.data.table(DFI)

  #Sort it first
  DFI<-DFI[order(Customer,FY),]

  #find cumulative sum; remove Rev column; order rows
  DFOTest<-DFI[,cumsum := cumsum(Rev),by=.(Customer,Product)][,.SD[which.max(cumsum)],by=.(FY,Customer,Product)][,("Rev"):=NULL][order(Customer,Product,FY)]

This code calculates the cumulative sum, but I am unable to define 5-year window and then calculate running sum. I have two questions:

Question 1) How do I calculate a 5-year running sum?

Question 2) Can someone please explain Mike's method on this thread ? It seems to be fast. However, I am not really sure what's going on there. I did see that someone requested some commentary, but I am not sure whether it is self-explanatory.

Thanks in advance. I have been struggling on this problem for two days.

542

asked Jan 25 '18 01:01

watchtower

2 Answers

My solution stays on the tidyverse side of things, however, if your source data is not excessive the performance difference may not be an issue.

I will start with declaring a function to calculate the rolling sum using tibbletime::rollify and expand the data frame to include missing FY values. Then group and summarise while applying the rolling sum.

library(tidyr)
library(dplyr)

rollsum_5 <- tibbletime::rollify(sum, window = 5)

df %>%
  complete(FY, Customer, Product) %>%
  replace_na(list(Rev = 0), Rev) %>%
  arrange(Customer, Product, FY) %>%
  group_by(Customer, Product, FY) %>%
  summarise(Rev = sum(Rev)) %>%
  mutate(cumsum = rollsum_5(Rev)) %>%
  ungroup %>%
  filter(Rev != 0)

# # A tibble: 16 x 5
#    Customer Product    FY   Rev cumsum
#       <dbl> <chr>   <dbl> <dbl>  <dbl>
#  1    13575 A        2011  4.00  NA   
#  2    13575 A        2012  3.00  NA   
#  3    13575 A        2013  3.00  NA   
#  4    13575 A        2015  1.00  11.0 
#  5    13575 A        2016  2.00   9.00
#  6    13575 B        2011  3.00  NA   
#  7    13575 B        2012  3.00  NA   
#  8    13575 B        2013  4.00  NA   
#  9    13575 B        2014  5.00  15.0 
# 10    13575 B        2015  6.00  21.0 
# 11    13578 A        2010  3.00  NA   
# 12    13578 A        2016  2.00   2.00
# 13    13578 B        2013  2.00  NA   
# 14    13578 C        2014  4.00   4.00
# 15    13578 D        2015  2.00   2.00
# 16    13578 E        2010  2.00  NA

N.B. The rolling sum in this case will only appear in the rows where the window (5 rows) are intact. It could be misleading to suggest that partial values are equal to a five year sum.

196

answered Sep 21 '22 17:09

Kevin Arseneau

1) rollapply Create a Sum function which takes FY and Rev as a 2 column matrix (or if not makes it one) and then sums the revenues for those years within k of the last year. Then convert DFI to a data table, sum rows having the same Customer/Product/Year and run rollapplyr with Sum for each Customer/Product group.

library(data.table)
library(zoo)

k <- 5
Sum <- function(x) {
  x <- matrix(x,, 2)
  FY <- x[, 1]
  Rev <- x[, 2]
  ok <- FY >= tail(FY, 1) - k + 1
  sum(Rev[ok])
}
DT <- as.data.table(DFI)
DT <- DT[, list(Rev = sum(Rev)), by = c("Customer", "Product", "FY")]
DT[, cumsum := rollapplyr(.SD, k, Sum, by.column = FALSE, partial = TRUE),
       by = c("Customer", "Product"), .SDcols = c("FY", "Rev")]

giving:

 > DT
    Customer Product   FY Rev cumsum
 1:    13575       A 2011   4      4
 2:    13575       A 2012   3      7
 3:    13575       A 2013   3     10
 4:    13575       A 2015   1     11
 5:    13575       A 2016   2      9
 6:    13575       B 2011   3      3
 7:    13575       B 2012   3      6
 8:    13575       B 2013   4     10
 9:    13575       B 2014   5     15
10:    13575       B 2015   6     21
11:    13578       A 2010   3      3
12:    13578       A 2016   2      2
13:    13578       B 2013   2      2
14:    13578       C 2014   4      4
15:    13578       D 2015   2      2
16:    13578       E 2010   2      2

2) data.table only

First sum rows that have the same Customer/Product/FY and then, grouping by Customer/Product, for each FY value, fy, pick out the Rev values whose FY values are between fy-k+1 and fy and sum.

library(data.table)

k <- 5
DT <- as.data.table(DFI)
DT <- DT[, list(Rev = sum(Rev)), by = c("Customer", "Product", "FY")]
DT[, cumsum := sapply(FY, function(fy) sum(Rev[between(FY, fy-k+1, fy)])),
       by = c("Customer", "Product")]

giving:

> DT
    Customer Product   FY Rev cumsum
 1:    13575       A 2011   4      4
 2:    13575       A 2012   3      7
 3:    13575       A 2013   3     10
 4:    13575       A 2015   1     11
 5:    13575       A 2016   2      9
 6:    13575       B 2011   3      3
 7:    13575       B 2012   3      6
 8:    13575       B 2013   4     10
 9:    13575       B 2014   5     15
10:    13575       B 2015   6     21
11:    13578       A 2010   3      3
12:    13578       A 2016   2      2
13:    13578       B 2013   2      2
14:    13578       C 2014   4      4
15:    13578       D 2015   2      2
16:    13578       E 2010   2      2

answered Sep 22 '22 17:09

G. Grothendieck

Related questions
                            
                                checking on success of write.csv in R
                            
                                What does size really mean in geom_point?
                            
                                Why does lapply() not retain my data.table keys?
                            
                                Add vertical lines to quantmod::chart_Series
                            
                                how to replace numbers on X axis by dates when using plot in R?
                            
                                Why is `poly` complaining about degree less than number of unique points?
                            
                                Substitute A for B and B for A in a string
                            
                                Filling bars in barplot with textiles in ggplot2 [duplicate]
                            
                                Linear model (lm) when dependent variable is a factor/categorical variable?
                            
                                Multiple RowSideColor columns heatmap.2 from gplots package
                            
                                r knitr chunk options for figure height / width are not working
                            
                                List of Rcpp sugar functions?
                            
                                Merge data frame with SpatialPolygonsDataFrame
                            
                                Select values from different columns based on a variable containing column names [duplicate]
                            
                                Divide each each cell of large matrix by sum of its row
                            
                                ggplot2: change strip.text position in facet_grid plot
                            
                                Set linetype for geom_vline?
                            
                                Create a default comment header template in R?
                            
                                Extract Text from Two-Column PDF with R
                            
                                How to retrieve Outlook inbox emails using R RDCOMClient?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With