Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get the cumulative sum by group in R?

Tags:

Suppose I have a dataframe such that:

df<-data.frame(id=1:8,group=c(1,0,0,1,1,0,1,0),rep=c(rep("d1",4),rep("d2",4)),value=rbinom(8,1,0.6))
df
  id group rep value
1  1     1  d1     0
2  2     0  d1     0
3  3     0  d1     0
4  4     1  d1     1
5  5     1  d2     1
6  6     0  d2     0
7  7     1  d2     1
8  8     0  d2     1

What's the best way to get the cumulative sum by group and rep such that:

cumsum
group d1  d1+d2  d1+d2+d3
0     0     1      ...
1     1     3      ...
like image 274
David Z Avatar asked Apr 11 '14 17:04

David Z


People also ask

What does Cumsum function do in R?

The cumsum() function in R computes the cumulative sum of elements in a vector object.

How do you do the cumulative sum of a panda?

Pandas DataFrame cumsum() Method The cumsum() method goes through the values in the DataFrame, from the top, row by row, adding the values with the value from the previous row, ending up with a DataFrame where the last row contains the sum of all values for each column.

What is cumulative summation?

Cumulative sums, or running totals, are used to display the total sum of data as it grows with time (or any other series or progression). This lets you view the total contribution so far of a given measure against time.


1 Answers

I'd recommend working with the tidy form of the data. Here's an approach with dplyr, but it would be trivial to translate to data.table or base R.

First I'll create the dataset, setting the random seed to make the example reproducible:

set.seed(1014)
df <- data.frame(
  id = 1:8,
  group = c(1, 0, 0, 1, 1, 0, 1, 0),
  rep = c(rep("d1", 4), rep("d2", 4)),
  value = rbinom(8, 1, 0.6)
)
df

%>   id group rep value
%> 1  1     1  d1     1
%> 2  2     0  d1     0
%> 3  3     0  d1     0
%> 4  4     1  d1     1
%> 5  5     1  d2     1
%> 6  6     0  d2     1
%> 7  7     1  d2     1
%> 8  8     0  d2     1

Next, using dplyr, I'll first collapse to individual rows by group, and then compute the cumulative sum:

library(dplyr)

df <- df %>% 
  group_by(group, rep) %>%
  summarise(value = sum(value)) %>%
  mutate(csum = cumsum(value))
df

%> Source: local data frame [4 x 4]
%> Groups: group
%> 
%>   group rep value csum
%> 1     0  d1     0    0
%> 2     0  d2     2    2
%> 3     1  d1     2    2
%> 4     1  d2     2    4

For most cases, you're best of leaving the data in this form (it will be easier to work for), but you can reshape if you need to:

library(reshape2)

dcast(df, group ~ rep, value.var = "csum")

%>   group d1 d2
%> 1     0  0  2
%> 2     1  2  4
like image 66
hadley Avatar answered Oct 02 '22 22:10

hadley