Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Frequency-weighted percentile in dataframe with dplyr

I am trying to calculate the percentile ranks of a value in a dataframe, and I also have an associated frequency in the dataframe to weight by. I'm struggling to come up with a solution that will calculate the percentile of the original value as if the overall distribution is that value replicated by the frequency and all the other values replicated by that frequency.

For example:

groceries <- tribble(
  ~item, ~price, ~freq,
  "apple",   1, 20,
  "banana",   2, 5,
  "carrot",   3, 1
)

groceries %>% 
    mutate(reg_ptile = percent_rank(price),
           wtd_ptile = weighted_percent_rank(price, wt = freq))

# the expected result would be:

# A tibble: 3 x 5
  item   price  freq reg_ptile wtd_ptile
  <chr>  <dbl> <dbl> <dbl>     <dbl>
1 apple      1    20  0.0      0.0
2 banana     2     5  0.5      0.8
3 carrot     3     1  1.0      1.0

percent_rank() is an actual dplyr function. How would the function weighted_percent_rank() be written? Not sure how to make this work in a dataframe and pipes. It would be swell if the solution could also work with groups.

Edit: Using uncount() doesn't really work because uncounting the data I'm using would result in 800 billion rows. Any other ideas?

like image 811
Adhi R. Avatar asked Oct 12 '25 20:10

Adhi R.


1 Answers

You can use tidyr::uncount to expand the number of rows as per frequency to get the weighted percentile, then reduce them back down with summarize, as per this regex:

library(dplyr)

groceries <- tribble(
  ~item, ~price, ~freq,
  "apple",   1, 10,
  "banana",   2, 5,
  "carrot",   3, 1
)

groceries %>% 
  tidyr::uncount(freq) %>% 
  mutate(wtd_ptile = percent_rank(price)) %>%
  group_by(item) %>%
  summarize_all(~.[1]) %>%
  mutate(ptile = percent_rank(price))
#> # A tibble: 3 x 4
#>   item   price wtd_ptile ptile
#>   <chr>  <dbl>     <dbl> <dbl>
#> 1 apple      1     0       0  
#> 2 banana     2     0.667   0.5
#> 3 carrot     3     1       1

Note there are different ranking functions you can choose, though in this case the weighted percentile is 0.667 ( 10/(16 - 1) ), not 0.8


EDIT

An alternative that does not involve creating billions of rows:

groceries %>% 
  arrange(price) %>% 
  mutate(wtd_ptile = lag(cumsum(freq), default = 0)/(sum(freq) - 1))
#> # A tibble: 3 x 4
#>   item   price  freq wtd_ptile
#>   <chr>  <dbl> <dbl>     <dbl>
#> 1 apple      1    10     0    
#> 2 banana     2     5     0.667
#> 3 carrot     3     1     1  
like image 177
Allan Cameron Avatar answered Oct 14 '25 15:10

Allan Cameron