I've tried to find a solution to this here, but nothing seems to address exactly my case. Sorry if I missed it.
I have a data frame with each row providing a position in one of various categories (e.g., row one corresponds position 2 within "category" 'A', but row 3 corresponds to position 4 within category 'B'). Each of those categories is to be split into a different set of tiles/intervals, and I would like to find a way to assign the positions within the original data frame into their corresponding tile/interval. For instance, given the input data and category breaks that follow:
library(tidyverse)
test_df <- tribble(
~category, ~pos,
'A', 2,
'A', 5,
'B', 4,
'B', 8
)
breaks <- tribble(
~category, ~start, ~end,
'A', 0, 4,
'A', 4, 7,
'A', 7, 10,
'B', 0, 3,
'B', 3, 5,
'B', 5, 10
)
The result I would like to obtain would be something like:
category pos tile
<chr> <dbl> <chr>
1 A 2 (0, 4]
2 A 5 (4, 7]
3 B 4 (3, 5]
4 B 8 (5, 10]
I would normally use cut
for similar tasks, but, as far as I'm aware, there's no way of defining different break points per group. The only way I have found to leverage group_by
to create distinct intervals with cut
, is by fixing the number of cuts to perform (which is not applicable in this case).
The best way I can come up to address my problem is this:
bind_rows(
lapply(
X=unique(test_df$category),
FUN=function(x) {
test_df %>%
filter(category==x) %>%
mutate(tile=cut(
pos,
breaks=c(0, filter(breaks, category==x)$end, Inf)))
} ) )
which provides the expected output, but doesn't feel elegant to me (and I am not sure how it would perform with literally millions of rows on the input).
Any suggestion on how to streamline it? Any way of keeping it "piped"?
Cheers,
Fran
one solution:
## rearrange "breaks"
breaks <-
breaks %>%
pivot_longer(cols = start:end) %>%
distinct(category, value) %>%
group_by(category) %>%
summarise(breaks = list(value))
## join and cut:
test_df %>%
left_join(breaks) %>%
rowwise %>%
mutate(tile = cut(pos, unlist(breaks))) %>%
ungroup ## reduce memory size of result object
output:
## # A tibble: 4 x 4
## # Rowwise:
## category pos breaks tile
## <chr> <dbl> <list> <fct>
## 1 A 2 <dbl [4]> (0,4]
## 2 A 5 <dbl [4]> (4,7]
## 3 B 4 <dbl [4]> (3,5]
## 4 B 8 <dbl [4]> (5,10]
edit
A faster (about 5 times) though less readable approach with a slightly smaller result object:
library(purrr)
## boil dataframe "breaks" down to a list of break vectors:
breaks_list <-
breaks %>%
## one tibble with columns 'start' and 'end' per 'category':
nest_by(category) %>%
## dataframe into list of tibbles, named with category
pull(data, category) %>%
## tibbles into vector of breaks
map(~ .x %>% as.matrix %>% c %>% unique %>% sort)
## get tile by indexing into "break_list" via mapping:
test_df %>%
mutate(
tile = map2(pos, category,
~ cut(.x, breaks_list[[.y]])
) %>% unlist
)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With