Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Cut numeric variable in data frame with different breaks according to group (preferably with dplyr)

I've tried to find a solution to this here, but nothing seems to address exactly my case. Sorry if I missed it.

I have a data frame with each row providing a position in one of various categories (e.g., row one corresponds position 2 within "category" 'A', but row 3 corresponds to position 4 within category 'B'). Each of those categories is to be split into a different set of tiles/intervals, and I would like to find a way to assign the positions within the original data frame into their corresponding tile/interval. For instance, given the input data and category breaks that follow:

library(tidyverse)

test_df <- tribble(
  ~category, ~pos,
  'A', 2,
  'A', 5,
  'B', 4,
  'B', 8
)

breaks <- tribble(
  ~category, ~start, ~end,
  'A', 0, 4,
  'A', 4, 7,
  'A', 7, 10,
  'B', 0, 3,
  'B', 3, 5,
  'B', 5, 10
)

The result I would like to obtain would be something like:

  category   pos tile   
  <chr>    <dbl> <chr>  
1 A            2 (0, 4] 
2 A            5 (4, 7] 
3 B            4 (3, 5] 
4 B            8 (5, 10]

I would normally use cut for similar tasks, but, as far as I'm aware, there's no way of defining different break points per group. The only way I have found to leverage group_by to create distinct intervals with cut, is by fixing the number of cuts to perform (which is not applicable in this case).

The best way I can come up to address my problem is this:

bind_rows(
  lapply(
    X=unique(test_df$category),
    FUN=function(x) {
      test_df %>%
        filter(category==x) %>%
        mutate(tile=cut(
          pos,
          breaks=c(0, filter(breaks, category==x)$end, Inf)))
    } ) )

which provides the expected output, but doesn't feel elegant to me (and I am not sure how it would perform with literally millions of rows on the input).

Any suggestion on how to streamline it? Any way of keeping it "piped"?

Cheers,

Fran

like image 956
Francisco Rodríguez Algarra Avatar asked Sep 01 '25 11:09

Francisco Rodríguez Algarra


1 Answers

one solution:

## rearrange "breaks"
breaks <- 
  breaks %>%
  pivot_longer(cols = start:end) %>%
  distinct(category, value) %>% 
  group_by(category) %>%
  summarise(breaks =  list(value))

## join and cut:
test_df %>%
  left_join(breaks) %>%
  rowwise %>%
  mutate(tile = cut(pos, unlist(breaks))) %>%
  ungroup ## reduce memory size of result object

output:


## # A tibble: 4 x 4
## # Rowwise: 
##   category   pos breaks    tile  
##   <chr>    <dbl> <list>    <fct> 
## 1 A            2 <dbl [4]> (0,4] 
## 2 A            5 <dbl [4]> (4,7] 
## 3 B            4 <dbl [4]> (3,5] 
## 4 B            8 <dbl [4]> (5,10]

edit

A faster (about 5 times) though less readable approach with a slightly smaller result object:

library(purrr)

## boil dataframe "breaks" down to a list of break vectors:

breaks_list <- 
  breaks %>%
  ## one tibble with columns 'start' and 'end' per 'category':
  nest_by(category) %>%
  ## dataframe into list of tibbles, named with category
  pull(data, category) %>%
  ## tibbles into vector of breaks
  map(~ .x %>% as.matrix %>% c %>% unique %>% sort) 

## get tile by indexing into "break_list" via mapping: 
test_df %>% 
mutate(
  tile = map2(pos, category,
              ~ cut(.x, breaks_list[[.y]])
              ) %>% unlist
)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!