r cumsum per group in dplyr

Tags:

I am starting to enjoy dplyr but I got stuck on a use case. I want to be able to apply cumsum per group in a dataframe with the package but I can't seem to get it right.

For a demo dataframe I've generated the following data:

set.seed(123)

len = 10 
dates = as.Date('2014-01-01') + 1:len
grp_a = data.frame(dates=dates, group='A', sales=rnorm(len))
grp_b = data.frame(dates=dates, group='B', sales=rnorm(len))
grp_c = data.frame(dates=dates, group='C', sales=rnorm(len))
df = rbind(grp_a, grp_b, grp_c)

This creates a dataframe that looks like:

        dates group       sales
1  2014-01-02     A -0.56047565
2  2014-01-03     A -0.23017749
3  2014-01-04     A  1.55870831
4  2014-01-05     A  0.07050839
5  2014-01-06     A  0.12928774
6  2014-01-02     B  1.71506499
7  2014-01-03     B  0.46091621
8  2014-01-04     B -1.26506123
9  2014-01-05     B -0.68685285
10 2014-01-06     B -0.44566197
11 2014-01-02     C  1.22408180
12 2014-01-03     C  0.35981383
13 2014-01-04     C  0.40077145
14 2014-01-05     C  0.11068272
15 2014-01-06     C -0.55584113

I then go on to create a dataframe for plotting, but with a for loop that I'd like to replace with something cleaner.

pdf = data.frame(dates=as.Date(as.character()), group=as.character(), sales=as.numeric())
for(grp in unique(df$group)){
  subs = filter(df, group == grp) %>% arrange(dates)
  pdf = rbind(pdf, data.frame(dates=subs$dates, group=grp, sales=cumsum(subs$sales)))
}

I use this pdf to create a plot.

p = ggplot() 
p = p + geom_line(data=pdf, aes(dates, sales, colour=group))
p + ggtitle("sales per group")

enter image description here

Is there a better way (a way by using the dplyr methods) to create this dataframe? I've looked at the summarize method but this seems to aggregate a group from N items -> 1 item. This use case seems to break my dplyr flow at the moment. Any suggestions to better approach this?

998

asked Dec 03 '14 15:12

cantdutchthis

2 Answers

Ah. After fiddling around I seem to have found it.

pdf = df %>% group_by(group) %>% arrange(dates) %>% mutate(cs = cumsum(sales))

Output with forloop in question:

> pdf = data.frame(dates=as.Date(as.character()), group=as.character(), sales=as.numeric())
> for(grp in unique(df$group)){
+   subs = filter(df, group == grp) %>% arrange(dates)
+   pdf = rbind(pdf, data.frame(dates=subs$dates, group=grp, sales=subs$sales, cs=cumsum(subs$sales)))
+ }
> pdf
        dates group       sales         cs
1  2014-01-02     A -0.56047565 -0.5604756
2  2014-01-03     A -0.23017749 -0.7906531
3  2014-01-04     A  1.55870831  0.7680552
4  2014-01-05     A  0.07050839  0.8385636
5  2014-01-06     A  0.12928774  0.9678513
6  2014-01-02     B  1.71506499  1.7150650
7  2014-01-03     B  0.46091621  2.1759812
8  2014-01-04     B -1.26506123  0.9109200
9  2014-01-05     B -0.68685285  0.2240671
10 2014-01-06     B -0.44566197 -0.2215949
11 2014-01-02     C  1.22408180  1.2240818
12 2014-01-03     C  0.35981383  1.5838956
13 2014-01-04     C  0.40077145  1.9846671
14 2014-01-05     C  0.11068272  2.0953498
15 2014-01-06     C -0.55584113  1.5395087

Output with this line of code:

> pdf = df %>% group_by(group) %>% mutate(cs = cumsum(sales))
> pdf
Source: local data frame [15 x 4]
Groups: group

        dates group       sales         cs
1  2014-01-02     A -0.56047565 -0.5604756
2  2014-01-03     A -0.23017749 -0.7906531
3  2014-01-04     A  1.55870831  0.7680552
4  2014-01-05     A  0.07050839  0.8385636
5  2014-01-06     A  0.12928774  0.9678513
6  2014-01-02     B  1.71506499  1.7150650
7  2014-01-03     B  0.46091621  2.1759812
8  2014-01-04     B -1.26506123  0.9109200
9  2014-01-05     B -0.68685285  0.2240671
10 2014-01-06     B -0.44566197 -0.2215949
11 2014-01-02     C  1.22408180  1.2240818
12 2014-01-03     C  0.35981383  1.5838956
13 2014-01-04     C  0.40077145  1.9846671
14 2014-01-05     C  0.11068272  2.0953498
15 2014-01-06     C -0.55584113  1.5395087

answered Oct 19 '22 16:10

cantdutchthis

try using

 group_by(group) %>%
 arrange(group) %>% 
 summarise(cs = sum(sales)) %>%
 mutate(sales_grp = cumsum(cs))

I know this Q/A is a bit dated, but this may help anyone stuck where I was stuck after reading the dplyr posted solution to cumsum() at https://dplyr.tidyverse.org/articles/window-functions.html. and http://www.sthda.com/english/articles/17-tips-tricks/57-dplyr-how-to-add-cumulative-sums-by-groups-into-a-data-framee/

The solutions posted at the links above don't sum by group. The code only adds the next row in sequence -- which is not a grouped cumulative sum. I would think the OP was looking for the sum of sales for Group A, Group B, and Group C with each group total added to the next -- your total n() in the OPs case should be 3 not 15 with a grouped cumsum(). This would be helpful if you were trying to calculate what percent of your annual sales occurred by the end of October, for example, such that you want the sum of sales by month then the total in sequence over the course of the year.

So, you need to first summarize the value by each group then sum them up in order of the groups. PS -- this is my first attempt at trying to answer a question on SO and posting a solution that worked for me; so I hope you will please be kind if my answer is wrong. I always try to learn from my mistakes.

answered Oct 19 '22 15:10

PigeonToast

Related questions
                            
                                Non character argument in R string split function (strsplit)
                            
                                R - dplyr Summarize and Retain Other Columns
                            
                                Are there any official naming conventions for R?
                            
                                dplyr mutate in R - add column as concat of columns
                            
                                Understanding .I in data.table in R
                            
                                ggplot geom_point() with colors based on specific, discrete values
                            
                                Why is this naive matrix multiplication faster than base R's?
                            
                                rm(list=ls()) doesn't completely clear the workspace
                            
                                ggplot2 : Adding two errorbars to each point in scatterplot
                            
                                R equivalent of python "_"?
                            
                                display a matrix, including the values, as a heatmap
                            
                                Using dynamic column names in `data.table`
                            
                                How to tell what packages you have used in R
                            
                                large amount of data in many text files - how to process?
                            
                                Write list of data.frames to separate CSV files with lapply
                            
                                Using the %>% pipe, and dot (.) notation
                            
                                How do I find the edges of a vertex using igraph and R?
                            
                                Position legend in first plot of facet
                            
                                Make conditionalPanel depend on files uploaded with fileInput
                            
                                Use different center than the prime meridian in plotting a world map

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

r cumsum per group in dplyr

Tags:

r

dplyr

ggplot2

cantdutchthis

People also ask

2 Answers

Output with forloop in question:

Output with this line of code:

cantdutchthis

PigeonToast

Recent Activity

Donate For Us