Let us say I have data like this:
haves <- data.frame(
indicator = c(1,1,2,2,2,3,3,3,3,3,3,3,3)
)
The indicator defines a group of a row and the rows are sorted by the indicator. The counts of rows for each group are as follows:
sum_stats <- haves %>% group_by(indicator) %>% summarise(n = n()) %>% ungroup()
sum_stats
# A tibble: 3 x 2
indicator n
<dbl> <int>
1 1 2
2 2 3
3 3 8
Let us say we have a threshold of 5 and I would like to combine consecutive groups as long as the total number of rows is not breaching the threshold. Groups should also never be broken up. So the wants for this simplified example are:
wants <- data.frame(
indicator = c(1, 1, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3)
, group = c(1,1,1,1,1,2,2,2,2,2,2,2,2)
)
wants
indicator group
1 1 1
2 1 1
3 2 1
4 2 1
5 2 1
6 3 2
7 3 2
8 3 2
9 3 2
10 3 2
11 3 2
12 3 2
13 3 2
Is this possible? Any input would be very much appreciated. Thanks!
I think this might help you as I tried it on an extended data sample. Here are some notes on this solution:
accumulate2
as we would like to see whether the sum of a vector with its previous value (lag value) is greater than 5n
without the last element and the second would be again n
without the first elementtibble
, putting my new var grp
into it on setting the first value of grp
as 1
per first grouping value.init
is supplied .x
and .y
(first and second vector) could be of the same length, otherwise the second one .y
should be one shorteraccumulate2
takes a three-argument function for better differentiating of variables I used ..1
, ..2
& ..3
as ..1
is the accumulated value starting from .init
, ..2
the next value in sequence of first vector n[-nrow(.)]
and ..3
the next value in sequence of the second vector n[-1]
..1
is the accumulated/ previous value of grp
, in case the sum of ..2
and ..3
is greater that 5
it remains as is, otherwise it will be added by 1
For further information you could check this page.library(dplyr)
library(tidyr)
library(purrr)
haves <- data.frame(
indicator = c(1,1,2,2,2,3,3,3,3,3,3,3,3, 4, 4, 5, 5, 6, 6, 6, 6, 6, 7, 8)
)
haves %>%
group_by(indicator) %>%
summarise(n = n()) %>%
ungroup() %>%
mutate(output = accumulate2(n[-nrow(.)], n[-1], .init = tibble(grp = 1),
~ tibble(grp = if(..2 + ..3 <= 5) {
..1$grp
} else {
..1$grp + 1
}))) %>%
unnest(output)
# A tibble: 8 x 3
indicator n grp
<dbl> <int> <dbl>
1 1 2 1
2 2 3 1
3 3 8 2
4 4 2 3
5 5 2 3
6 6 5 4
7 7 1 5
8 8 1 5
Dear Ronak has sometimes referred about library MESS
, where is useful for its function cumsumbinning
doing exactly as desired.
library(dplyr, warn.conflicts = FALSE)
library(tidyr, warn.conflicts = FALSE)
library(MESS)
haves <- data.frame(
indicator = c(1,1,2,2,2,3,3,3,3,3,3,3,3)
)
haves %>%
count(indicator) %>%
mutate(xx = cumsumbinning(n, 5)) %>%
uncount(n)
#> indicator xx
#> 1 1 1
#> 2 1 1
#> 3 2 1
#> 4 2 1
#> 5 2 1
#> 6 3 2
#> 7 3 2
#> 8 3 2
#> 9 3 2
#> 10 3 2
#> 11 3 2
#> 12 3 2
#> 13 3 2
Created on 2021-07-10 by the reprex package (v2.0.0)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With