Frequency-weighted percentile in dataframe with dplyr

Question

I am trying to calculate the percentile ranks of a value in a dataframe, and I also have an associated frequency in the dataframe to weight by. I'm struggling to come up with a solution that will calculate the percentile of the original value as if the overall distribution is that value replicated by the frequency and all the other values replicated by that frequency.

For example:

groceries <- tribble(
  ~item, ~price, ~freq,
  "apple",   1, 20,
  "banana",   2, 5,
  "carrot",   3, 1
)

groceries %>% 
    mutate(reg_ptile = percent_rank(price),
           wtd_ptile = weighted_percent_rank(price, wt = freq))

# the expected result would be:

# A tibble: 3 x 5
  item   price  freq reg_ptile wtd_ptile
  <chr>  <dbl> <dbl> <dbl>     <dbl>
1 apple      1    20  0.0      0.0
2 banana     2     5  0.5      0.8
3 carrot     3     1  1.0      1.0

percent_rank() is an actual dplyr function. How would the function weighted_percent_rank() be written? Not sure how to make this work in a dataframe and pipes. It would be swell if the solution could also work with groups.

Edit: Using uncount() doesn't really work because uncounting the data I'm using would result in 800 billion rows. Any other ideas?

Allan Cameron · Accepted Answer

You can use tidyr::uncount to expand the number of rows as per frequency to get the weighted percentile, then reduce them back down with summarize, as per this regex:

library(dplyr)

groceries <- tribble(
  ~item, ~price, ~freq,
  "apple",   1, 10,
  "banana",   2, 5,
  "carrot",   3, 1
)

groceries %>% 
  tidyr::uncount(freq) %>% 
  mutate(wtd_ptile = percent_rank(price)) %>%
  group_by(item) %>%
  summarize_all(~.[1]) %>%
  mutate(ptile = percent_rank(price))
#> # A tibble: 3 x 4
#>   item   price wtd_ptile ptile
#>   <chr>  <dbl>     <dbl> <dbl>
#> 1 apple      1     0       0  
#> 2 banana     2     0.667   0.5
#> 3 carrot     3     1       1

Note there are different ranking functions you can choose, though in this case the weighted percentile is 0.667 ( 10/(16 - 1) ), not 0.8

EDIT

An alternative that does not involve creating billions of rows:

groceries %>% 
  arrange(price) %>% 
  mutate(wtd_ptile = lag(cumsum(freq), default = 0)/(sum(freq) - 1))
#> # A tibble: 3 x 4
#>   item   price  freq wtd_ptile
#>   <chr>  <dbl> <dbl>     <dbl>
#> 1 apple      1    10     0    
#> 2 banana     2     5     0.667
#> 3 carrot     3     1     1

Frequency-weighted percentile in dataframe with dplyr

Tags:

r

dplyr

statistics

Adhi R.

1 Answers

Allan Cameron

Recent Activity

Donate For Us

Frequency-weighted percentile in dataframe with dplyr

Tags:

r

dplyr

statistics

Adhi R.

1 Answers

Allan Cameron

Related questions

Recent Activity

Donate For Us