Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Aggregate with adjacent group if value falls below a threshold

Tags:

r

dplyr

I am trying to figure out a way to aggregate levels of a group creating a new level based on a threshold value of what you are aggregating.

Create some data:

library(tidyr)
library(dplyr)

demo_data <- as_tibble(VADeaths) %>% 
  mutate(age_bucket = row.names(VADeaths)) %>% 
  pivot_longer(-age_bucket) %>% 
  arrange(name)

Here are a bunch of values below our threshold (say 15 here)

demo_data %>% 
  filter(value < 15)
#> # A tibble: 5 x 3
#>   age_bucket name         value
#>   <chr>      <chr>        <dbl>
#> 1 50-54      Rural Female   8.7
#> 2 55-59      Rural Female  11.7
#> 3 50-54      Rural Male    11.7
#> 4 50-54      Urban Female   8.4
#> 5 55-59      Urban Female  13.6

Now I can use some logic to do this with case_when but this seems fragile because it is so specific. This does, however, illustrate what I am after:

demo_data %>% 
  mutate(age_bucket_agg = case_when(
    age_bucket %in% c("50-54", "55-59") & name == "Rural Female" ~ "50-59",
    age_bucket %in% c("50-54", "55-59") & name == "Urban Female" ~ "50-59",
    age_bucket %in% c("50-54", "55-59") & name == "Rural Male" ~ "50-59",
    TRUE ~ age_bucket
  )
  ) %>% 
  group_by(age_bucket_agg, name) %>% 
  summarise(value = sum(value))
#> `summarise()` regrouping output by 'age_bucket_agg' (override with `.groups` argument)
#> # A tibble: 17 x 3
#> # Groups:   age_bucket_agg [6]
#>    age_bucket_agg name         value
#>    <chr>          <chr>        <dbl>
#>  1 50-54          Urban Male    15.4
#>  2 50-59          Rural Female  20.4
#>  3 50-59          Rural Male    29.8
#>  4 50-59          Urban Female  22  
#>  5 55-59          Urban Male    24.3
#>  6 60-64          Rural Female  20.3
#>  7 60-64          Rural Male    26.9
#>  8 60-64          Urban Female  19.3
#>  9 60-64          Urban Male    37  
#> 10 65-69          Rural Female  30.9
#> 11 65-69          Rural Male    41  
#> 12 65-69          Urban Female  35.1
#> 13 65-69          Urban Male    54.6
#> 14 70-74          Rural Female  54.3
#> 15 70-74          Rural Male    66  
#> 16 70-74          Urban Female  50  
#> 17 70-74          Urban Male    71.1

My question is can anyone think of an automated way of doing this? How can I tell dplyr (or R in general) to take all values below as threshold and add them to the next age_bucket and then recode that grouping level to take the lowest value and the biggest value and create a new range.

like image 217
boshek Avatar asked Nov 07 '22 06:11

boshek


1 Answers

I think your example is a bit too minimal for this really challenging question. I added some challenges to your data which I think the approaches of the other answers can't tackle yet. My approach is quite verbose. Essentially, it checks every logical combination / direction in which age buckets could be merged and then recursively merges the age buckets until the threshold is met or until there are no other age buckets left to merge together. With a bit more work we could turn this into a more general function.

library(tidyverse)

demo_data <- as_tibble(VADeaths) %>% 
  mutate(age_bucket = row.names(VADeaths)) %>% 
  pivot_longer(-age_bucket) %>% 
  arrange(name) %>% 
  # lets add more challenges to the data
  mutate(value = case_when(
    age_bucket == "55-59" & name == "Rural Female" ~ 2,
    age_bucket == "70-74" & name == "Rural Male" ~ 13,
    age_bucket == "65-69" & name == "Urban Female" ~ 8,
    age_bucket == "70-74" & name == "Urban Male" ~ 3,
    T ~ value))

# function that implements merging age buckets
merge_impl <- function(x) {
  
  if(any(x$first)) {
    e <- filter(x, first == 1)
    
    if (e$id & !is.na(e$age_max_lead)) {
      out <- mutate(x,
                     age_max = if_else(first,
                                       age_max_lead,
                                       age_max),
                     value = if_else(first,
                                     value + value_lead,
                                     value)) 
      out <- filter(out, !lag(first, default = FALSE))

      
    } else if (e$id & is.na(e$age_max_lead & !is.na(e$age_min_lag))) {
      out <- mutate(x,
                     age_min = if_else(first,
                                       age_min_lag,
                                       age_min),
                     value = if_else(first,
                                     value + value_lag,
                                     value))
      out <- filter(out, !lead(first, default = FALSE))
      
    } else if (e$id & is.na(e$age_max_lead & is.na(e$age_min_lag))) {
      out <- x
    } else if (!e$id & !is.na(e$age_min_lag)) {
      out <- mutate(x,
                     age_min = if_else(first,
                                       age_min_lag,
                                       age_min),
                     value = if_else(first,
                                     value + value_lag,
                                     value)) 
      out <- filter(out, !lead(first, default = FALSE))

    } else if (!e$id & is.na(e$age_min_lag) & !is.na(e$age_max_lead)) {
      out <- mutate(x,
                     age_max = if_else(first,
                                       age_max_lead,
                                       age_max),
                     value = if_else(first,
                                     value + value_lead,
                                     value)) %>% 
        out <- filter(out, !lag(first, default = FALSE))

    } else if (!e$id & is.na(e$age_min_lag) & is.na(e$age_max_lead)) {
      out <- x
    }
  } else { 
    out <- x
  }

  select(out,
         -contains("lead"), -contains("lag"),
         -first, -id)
}

merge_age_buckets <- function(x, threshold) {
  
  # initialize
  data_ls <-
    x %>% 
    separate(age_bucket,
             c("age_min", "age_max"),
             convert = TRUE) %>% 
    group_by(name) %>% 
    mutate(across(c(age_min, age_max, value),
                    list(lead = ~ lead(.x),
                         lag  = ~ lag(.x))
                   )
    ) %>% 
    mutate(id = age_min %% 10 == 0,
           first = value < threshold & cumsum(value < threshold) == 1) %>% 
    group_split 

   # check & proceed
   if(any(map_lgl(data_ls, ~ any(.x$first & nrow(.x) > 1)))) {
     res <- map_dfr(data_ls, merge_impl) %>% 
       mutate(age_bucket = paste0(age_min, "-", age_max)) %>% 
       select(- c(age_min, age_max))
     # if result still needs adjustment repeat
     if(any(res$value < threshold)) {
       merge_age_buckets(res, threshold = threshold)
     } else {
       return(res)
       }
   } else {
     out <- reduce(data_ls, bind_rows) %>% 
       mutate(age_buckets = paste0(age_min, "-", age_max)) %>% 
       select(- c(age_min, age_max))
     return(out)
   }
}
 
merge_age_buckets(demo_data, 15)
#> # A tibble: 13 x 3
#>    name         value age_bucket
#>    <chr>        <dbl> <chr>     
#>  1 Rural Female  31   50-64     
#>  2 Rural Female  30.9 65-69     
#>  3 Rural Female  54.3 70-74     
#>  4 Rural Male    29.8 50-59     
#>  5 Rural Male    26.9 60-64     
#>  6 Rural Male    54   65-74     
#>  7 Urban Female  22   50-59     
#>  8 Urban Female  27.3 60-69     
#>  9 Urban Female  50   70-74     
#> 10 Urban Male    15.4 50-54     
#> 11 Urban Male    24.3 55-59     
#> 12 Urban Male    37   60-64     
#> 13 Urban Male    57.6 65-74

Created on 2020-06-23 by the reprex package (v0.3.0)

like image 111
TimTeaFan Avatar answered Nov 11 '22 09:11

TimTeaFan