I am starting to enjoy dplyr
but I got stuck on a use case. I want to be able to apply cumsum
per group in a dataframe with the package but I can't seem to get it right.
For a demo dataframe I've generated the following data:
set.seed(123)
len = 10
dates = as.Date('2014-01-01') + 1:len
grp_a = data.frame(dates=dates, group='A', sales=rnorm(len))
grp_b = data.frame(dates=dates, group='B', sales=rnorm(len))
grp_c = data.frame(dates=dates, group='C', sales=rnorm(len))
df = rbind(grp_a, grp_b, grp_c)
This creates a dataframe that looks like:
dates group sales
1 2014-01-02 A -0.56047565
2 2014-01-03 A -0.23017749
3 2014-01-04 A 1.55870831
4 2014-01-05 A 0.07050839
5 2014-01-06 A 0.12928774
6 2014-01-02 B 1.71506499
7 2014-01-03 B 0.46091621
8 2014-01-04 B -1.26506123
9 2014-01-05 B -0.68685285
10 2014-01-06 B -0.44566197
11 2014-01-02 C 1.22408180
12 2014-01-03 C 0.35981383
13 2014-01-04 C 0.40077145
14 2014-01-05 C 0.11068272
15 2014-01-06 C -0.55584113
I then go on to create a dataframe for plotting, but with a for loop that I'd like to replace with something cleaner.
pdf = data.frame(dates=as.Date(as.character()), group=as.character(), sales=as.numeric())
for(grp in unique(df$group)){
subs = filter(df, group == grp) %>% arrange(dates)
pdf = rbind(pdf, data.frame(dates=subs$dates, group=grp, sales=cumsum(subs$sales)))
}
I use this pdf
to create a plot.
p = ggplot()
p = p + geom_line(data=pdf, aes(dates, sales, colour=group))
p + ggtitle("sales per group")
Is there a better way (a way by using the dplyr methods) to create this dataframe? I've looked at the summarize
method but this seems to aggregate a group from N items -> 1 item. This use case seems to break my dplyr flow at the moment. Any suggestions to better approach this?
Introduction to Cumsum MATLAB. Cumsum stands for the cumulative sum of elements. It gives the overall addition of elements trough an array or set of elements. It is used for a range of numbers or sets of random numbers.
Ah. After fiddling around I seem to have found it.
pdf = df %>% group_by(group) %>% arrange(dates) %>% mutate(cs = cumsum(sales))
> pdf = data.frame(dates=as.Date(as.character()), group=as.character(), sales=as.numeric())
> for(grp in unique(df$group)){
+ subs = filter(df, group == grp) %>% arrange(dates)
+ pdf = rbind(pdf, data.frame(dates=subs$dates, group=grp, sales=subs$sales, cs=cumsum(subs$sales)))
+ }
> pdf
dates group sales cs
1 2014-01-02 A -0.56047565 -0.5604756
2 2014-01-03 A -0.23017749 -0.7906531
3 2014-01-04 A 1.55870831 0.7680552
4 2014-01-05 A 0.07050839 0.8385636
5 2014-01-06 A 0.12928774 0.9678513
6 2014-01-02 B 1.71506499 1.7150650
7 2014-01-03 B 0.46091621 2.1759812
8 2014-01-04 B -1.26506123 0.9109200
9 2014-01-05 B -0.68685285 0.2240671
10 2014-01-06 B -0.44566197 -0.2215949
11 2014-01-02 C 1.22408180 1.2240818
12 2014-01-03 C 0.35981383 1.5838956
13 2014-01-04 C 0.40077145 1.9846671
14 2014-01-05 C 0.11068272 2.0953498
15 2014-01-06 C -0.55584113 1.5395087
> pdf = df %>% group_by(group) %>% mutate(cs = cumsum(sales))
> pdf
Source: local data frame [15 x 4]
Groups: group
dates group sales cs
1 2014-01-02 A -0.56047565 -0.5604756
2 2014-01-03 A -0.23017749 -0.7906531
3 2014-01-04 A 1.55870831 0.7680552
4 2014-01-05 A 0.07050839 0.8385636
5 2014-01-06 A 0.12928774 0.9678513
6 2014-01-02 B 1.71506499 1.7150650
7 2014-01-03 B 0.46091621 2.1759812
8 2014-01-04 B -1.26506123 0.9109200
9 2014-01-05 B -0.68685285 0.2240671
10 2014-01-06 B -0.44566197 -0.2215949
11 2014-01-02 C 1.22408180 1.2240818
12 2014-01-03 C 0.35981383 1.5838956
13 2014-01-04 C 0.40077145 1.9846671
14 2014-01-05 C 0.11068272 2.0953498
15 2014-01-06 C -0.55584113 1.5395087
try using
group_by(group) %>%
arrange(group) %>%
summarise(cs = sum(sales)) %>%
mutate(sales_grp = cumsum(cs))
I know this Q/A is a bit dated, but this may help anyone stuck where I was stuck after reading the dplyr posted solution to cumsum() at https://dplyr.tidyverse.org/articles/window-functions.html. and http://www.sthda.com/english/articles/17-tips-tricks/57-dplyr-how-to-add-cumulative-sums-by-groups-into-a-data-framee/
The solutions posted at the links above don't sum by group. The code only adds the next row in sequence -- which is not a grouped cumulative sum. I would think the OP was looking for the sum of sales for Group A, Group B, and Group C with each group total added to the next -- your total n() in the OPs case should be 3 not 15 with a grouped cumsum(). This would be helpful if you were trying to calculate what percent of your annual sales occurred by the end of October, for example, such that you want the sum of sales by month then the total in sequence over the course of the year.
So, you need to first summarize the value by each group then sum them up in order of the groups. PS -- this is my first attempt at trying to answer a question on SO and posting a solution that worked for me; so I hope you will please be kind if my answer is wrong. I always try to learn from my mistakes.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With