I have a data frame, the data frame is already sorted as needed, but now I will like to "slice it" in groups.
This groups should have a max cumulative value of 10. When the cumulative value is > 10, it should reset the cumulative sum and start over again
library(dplyr)
id <- sample(1:15)
order <- 1:15
value <- c(4, 5, 7, 3, 8, 1, 2, 5, 3, 6, 2, 6, 3, 1, 4)
df <- data.frame(id, order, value)
df
This is the output I'm looking for(I did it "manually")
cumsum_10 <- c(4, 9, 7, 10, 8, 9, 2, 7, 10, 6, 8, 6, 9, 10, 4)
group_10 <- c(1, 1, 2, 2, 3, 3, 4, 4, 4, 5, 5, 6, 6, 6, 7)
df1 <- data.frame(df, cumsum_10, group_10)
df1
So I'm having 2 problems
For the first part I was trying some combinations of group_by and cumsum with no luck
df1 <- df %>% group_by(cumsum(c(False, value < 10)))
I would prefer a pipe (%>%) solution instead of a for loop
Thanks
I think this is not easily vectorizable.... at least i do not know how.
You can do it by hand
via:
my_cumsum <- function(x){
grp = integer(length(x))
grp[1] = 1
for(i in 2:length(x)){
if(x[i-1] + x[i] <= 10){
grp[i] = grp[i-1]
x[i] = x[i-1] + x[i]
} else {
grp[i] = grp[i-1] + 1
}
}
data.frame(grp, x)
}
For your data this gives:
> my_cumsum(df$value)
grp x
1 1 4
2 1 9
3 2 7
4 2 10
5 3 8
6 3 9
7 4 2
8 4 7
9 4 10
10 5 6
11 5 8
12 6 6
13 6 9
14 6 10
15 7 4
Also for my "counter-example" this gives:
> my_cumsum(c(10,6,4))
grp x
1 1 10
2 2 6
3 2 10
As @Khashaa pointed out this can be implementet more efficiently via Rcpp
. He linked to this answer How to speed up or vectorize a for loop? which i find very useful
You could define your own function and then use it inside dplyr's mutate
statement as follows:
df %>% group_by() %>%
mutate(
cumsum_10 = cumsum_with_reset(value, 10),
group_10 = cumsum_with_reset_group(value, 10)
) %>%
ungroup()
The cumsum_with_reset()
function takes a column and a threshold value which resets the sum. cumsum_with_reset_group()
is similar but identifies rows that have been grouped together. Definitions are as follows:
# group rows based on cumsum with reset
cumsum_with_reset_group <- function(x, threshold) {
cumsum <- 0
group <- 1
result <- numeric()
for (i in 1:length(x)) {
cumsum <- cumsum + x[i]
if (cumsum > threshold) {
group <- group + 1
cumsum <- x[i]
}
result = c(result, group)
}
return (result)
}
# cumsum with reset
cumsum_with_reset <- function(x, threshold) {
cumsum <- 0
group <- 1
result <- numeric()
for (i in 1:length(x)) {
cumsum <- cumsum + x[i]
if (cumsum > threshold) {
group <- group + 1
cumsum <- x[i]
}
result = c(result, cumsum)
}
return (result)
}
# use functions above as window functions inside mutate statement
df %>% group_by() %>%
mutate(
cumsum_10 = cumsum_with_reset(value, 10),
group_10 = cumsum_with_reset_group(value, 10)
) %>%
ungroup()
This can be done easily with purrr::accumulate
library(dplyr)
library(purrr)
df %>% mutate(cumsum_10 = accumulate(value, ~ifelse(.x + .y <= 10, .x + .y, .y)),
group_10 = cumsum(value == cumsum_10))
id order value cumsum_10 group_10
1 8 1 4 4 1
2 13 2 5 9 1
3 7 3 7 7 2
4 1 4 3 10 2
5 4 5 8 8 3
6 10 6 1 9 3
7 12 7 2 2 4
8 2 8 5 7 4
9 15 9 3 10 4
10 11 10 6 6 5
11 14 11 2 8 5
12 3 12 6 6 6
13 5 13 3 9 6
14 9 14 1 10 6
15 6 15 4 4 7
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With