Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

split data into chunks whilst respecting groups and threshold

Tags:

r

Let us say I have data like this:

haves <- data.frame(
        indicator = c(1,1,2,2,2,3,3,3,3,3,3,3,3)
    )

The indicator defines a group of a row and the rows are sorted by the indicator. The counts of rows for each group are as follows:

sum_stats <- haves %>% group_by(indicator) %>% summarise(n = n()) %>% ungroup()
sum_stats
# A tibble: 3 x 2
  indicator     n
      <dbl> <int>
1         1     2
2         2     3
3         3     8

Let us say we have a threshold of 5 and I would like to combine consecutive groups as long as the total number of rows is not breaching the threshold. Groups should also never be broken up. So the wants for this simplified example are:

wants <- data.frame(
        indicator = c(1, 1, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3)
        , group = c(1,1,1,1,1,2,2,2,2,2,2,2,2)
    )
wants
   indicator group
1          1     1
2          1     1
3          2     1
4          2     1
5          2     1
6          3     2
7          3     2
8          3     2
9          3     2
10         3     2
11         3     2
12         3     2
13         3     2

Is this possible? Any input would be very much appreciated. Thanks!

like image 682
cs0815 Avatar asked Mar 02 '23 12:03

cs0815


2 Answers

I think this might help you as I tried it on an extended data sample. Here are some notes on this solution:

  • I decided to use accumulate2 as we would like to see whether the sum of a vector with its previous value (lag value) is greater than 5
  • So the first vector would be n without the last element and the second would be again n without the first element
  • Here I also created a tibble, putting my new var grp into it on setting the first value of grp as 1 per first grouping value
  • It should be noted that while .init is supplied .x and .y (first and second vector) could be of the same length, otherwise the second one .y should be one shorter
  • Since accumulate2 takes a three-argument function for better differentiating of variables I used ..1, ..2 & ..3 as ..1 is the accumulated value starting from .init, ..2 the next value in sequence of first vector n[-nrow(.)] and ..3 the next value in sequence of the second vector n[-1]
  • Since ..1 is the accumulated/ previous value of grp, in case the sum of ..2 and ..3 is greater that 5 it remains as is, otherwise it will be added by 1 For further information you could check this page.
library(dplyr)
library(tidyr)
library(purrr)

haves <- data.frame(
  indicator = c(1,1,2,2,2,3,3,3,3,3,3,3,3, 4, 4, 5, 5, 6, 6, 6, 6, 6, 7, 8)
)

haves %>% 
  group_by(indicator) %>% 
  summarise(n = n()) %>% 
  ungroup() %>% 
  mutate(output = accumulate2(n[-nrow(.)], n[-1], .init = tibble(grp = 1), 
                           ~ tibble(grp = if(..2 + ..3 <= 5) {
                             ..1$grp
                           } else {
                             ..1$grp + 1
                           }))) %>%
  unnest(output)


# A tibble: 8 x 3
  indicator     n   grp
      <dbl> <int> <dbl>
1         1     2     1
2         2     3     1
3         3     8     2
4         4     2     3
5         5     2     3
6         6     5     4
7         7     1     5
8         8     1     5
like image 145
Anoushiravan R Avatar answered Mar 16 '23 00:03

Anoushiravan R


Dear Ronak has sometimes referred about library MESS, where is useful for its function cumsumbinning doing exactly as desired.

library(dplyr, warn.conflicts = FALSE)
library(tidyr, warn.conflicts = FALSE)
library(MESS)

haves <- data.frame(
  indicator = c(1,1,2,2,2,3,3,3,3,3,3,3,3)
)

haves %>%
  count(indicator) %>%
  mutate(xx = cumsumbinning(n, 5)) %>%
  uncount(n)
#>    indicator xx
#> 1          1  1
#> 2          1  1
#> 3          2  1
#> 4          2  1
#> 5          2  1
#> 6          3  2
#> 7          3  2
#> 8          3  2
#> 9          3  2
#> 10         3  2
#> 11         3  2
#> 12         3  2
#> 13         3  2

Created on 2021-07-10 by the reprex package (v2.0.0)

like image 38
AnilGoyal Avatar answered Mar 16 '23 00:03

AnilGoyal