I would like to do a cumulative sum on a field but reset the aggregated value whenever a 0 is encountered.
Here is an example of what I want :
data.frame(campaign = letters[1:4] ,
date=c("jan","feb","march","april"),
b = c(1,0,1,1) ,
whatiwant = c(1,0,1,2)
)
campaign date b whatiwant
1 a jan 1 1
2 b feb 0 0
3 c march 1 1
4 d april 1 2
The real example of a cumulative sum is the increasing amount of water in a swing pool. Example: Input: 10, 15, 20, 25, 30. Output: 10, 25, 45, 70, 100.
Cumulative sums, or running totals, are used to display the total sum of data as it grows with time (or any other series or progression). This lets you view the total contribution so far of a given measure against time.
If A is a vector, then cumsum(A) returns a vector containing the cumulative sum of the elements of A . If A is a matrix, then cumsum(A) returns a matrix containing the cumulative sums for each column of A . If A is a multidimensional array, then cumsum(A) acts along the first nonsingleton dimension.
To create a cumulative sum plot in base R, we can simply use plot function. For cumulative sums inside the plot, the cumsum function needs to be used for the variable that has to be summed up with cumulation.
Another late idea:
ff = function(x)
{
cs = cumsum(x)
cs - cummax((x == 0) * cs)
}
ff(c(0, 1, 3, 0, 0, 5, 2))
#[1] 0 1 4 0 0 5 7
And to compare:
library(data.table)
ffdt = function(x)
data.table(x)[, whatiwant := cumsum(x), by = rleid(x == 0L)]$whatiwant
x = as.numeric(x) ##because 'cumsum' causes integer overflow
identical(ff(x), ffdt(x))
#[1] TRUE
microbenchmark::microbenchmark(ff(x), ffdt(x), times = 25)
#Unit: milliseconds
# expr min lq median uq max neval
# ff(x) 315.8010 362.1089 372.1273 386.3892 405.5218 25
# ffdt(x) 374.6315 407.2754 417.6675 447.8305 534.8153 25
Another base would be just
with(df, ave(b, cumsum(b == 0), FUN = cumsum))
## [1] 1 0 1 2
This will just divide column b
to groups according to 0
appearances and compute the cumulative sum of b
per these groups
Another solution using the latest data.table
version (v 1.9.6+)
library(data.table) ## v 1.9.6+
setDT(df)[, whatiwant := cumsum(b), by = rleid(b == 0L)]
# campaign date b whatiwant
# 1: a jan 1 1
# 2: b feb 0 0
# 3: c march 1 1
# 4: d april 1 2
Some benchmarks per comments
set.seed(123)
x <- sample(0:1e3, 1e7, replace = TRUE)
system.time(res1 <- ave(x, cumsum(x == 0), FUN = cumsum))
# user system elapsed
# 1.54 0.24 1.81
system.time(res2 <- Reduce(function(x, y) if (y == 0) 0 else x+y, x, accumulate=TRUE))
# user system elapsed
# 33.94 0.39 34.85
library(data.table)
system.time(res3 <- data.table(x)[, whatiwant := cumsum(x), by = rleid(x == 0L)])
# user system elapsed
# 0.20 0.00 0.21
identical(res1, as.integer(res2))
## [1] TRUE
identical(res1, res3$whatiwant)
## [1] TRUE
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With