split data into chunks whilst respecting groups and threshold

Question

Let us say I have data like this:

haves <- data.frame(
        indicator = c(1,1,2,2,2,3,3,3,3,3,3,3,3)
    )

The indicator defines a group of a row and the rows are sorted by the indicator. The counts of rows for each group are as follows:

sum_stats <- haves %>% group_by(indicator) %>% summarise(n = n()) %>% ungroup()
sum_stats
# A tibble: 3 x 2
  indicator     n
      <dbl> <int>
1         1     2
2         2     3
3         3     8

Let us say we have a threshold of 5 and I would like to combine consecutive groups as long as the total number of rows is not breaching the threshold. Groups should also never be broken up. So the wants for this simplified example are:

wants <- data.frame(
        indicator = c(1, 1, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3)
        , group = c(1,1,1,1,1,2,2,2,2,2,2,2,2)
    )
wants
   indicator group
1          1     1
2          1     1
3          2     1
4          2     1
5          2     1
6          3     2
7          3     2
8          3     2
9          3     2
10         3     2
11         3     2
12         3     2
13         3     2

Is this possible? Any input would be very much appreciated. Thanks!

Anoushiravan R · Accepted Answer

I think this might help you as I tried it on an extended data sample. Here are some notes on this solution:

I decided to use accumulate2 as we would like to see whether the sum of a vector with its previous value (lag value) is greater than 5
So the first vector would be n without the last element and the second would be again n without the first element
Here I also created a tibble, putting my new var grp into it on setting the first value of grp as 1 per first grouping value
It should be noted that while .init is supplied .x and .y (first and second vector) could be of the same length, otherwise the second one .y should be one shorter
Since accumulate2 takes a three-argument function for better differentiating of variables I used ..1, ..2 & ..3 as ..1 is the accumulated value starting from .init, ..2 the next value in sequence of first vector n[-nrow(.)] and ..3 the next value in sequence of the second vector n[-1]
Since ..1 is the accumulated/ previous value of grp, in case the sum of ..2 and ..3 is greater that 5 it remains as is, otherwise it will be added by 1 For further information you could check this page.

library(dplyr)
library(tidyr)
library(purrr)

haves <- data.frame(
  indicator = c(1,1,2,2,2,3,3,3,3,3,3,3,3, 4, 4, 5, 5, 6, 6, 6, 6, 6, 7, 8)
)

haves %>% 
  group_by(indicator) %>% 
  summarise(n = n()) %>% 
  ungroup() %>% 
  mutate(output = accumulate2(n[-nrow(.)], n[-1], .init = tibble(grp = 1), 
                           ~ tibble(grp = if(..2 + ..3 <= 5) {
                             ..1$grp
                           } else {
                             ..1$grp + 1
                           }))) %>%
  unnest(output)


# A tibble: 8 x 3
  indicator     n   grp
      <dbl> <int> <dbl>
1         1     2     1
2         2     3     1
3         3     8     2
4         4     2     3
5         5     2     3
6         6     5     4
7         7     1     5
8         8     1     5

AnilGoyal · Answer

Dear Ronak has sometimes referred about library MESS, where is useful for its function cumsumbinning doing exactly as desired.

library(dplyr, warn.conflicts = FALSE)
library(tidyr, warn.conflicts = FALSE)
library(MESS)

haves <- data.frame(
  indicator = c(1,1,2,2,2,3,3,3,3,3,3,3,3)
)

haves %>%
  count(indicator) %>%
  mutate(xx = cumsumbinning(n, 5)) %>%
  uncount(n)
#>    indicator xx
#> 1          1  1
#> 2          1  1
#> 3          2  1
#> 4          2  1
#> 5          2  1
#> 6          3  2
#> 7          3  2
#> 8          3  2
#> 9          3  2
#> 10         3  2
#> 11         3  2
#> 12         3  2
#> 13         3  2

^{Created on 2021-07-10 by the reprex package (v2.0.0)}

split data into chunks whilst respecting groups and threshold

Tags:

r

cs0815

2 Answers

Anoushiravan R

AnilGoyal

Recent Activity

Donate For Us

split data into chunks whilst respecting groups and threshold

Tags:

r

cs0815

2 Answers

Anoushiravan R

AnilGoyal

Related questions

Recent Activity

Donate For Us