Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

summarise wtd.quantile by group

Tags:

r

dplyr

plyr

hmisc

I want to create a new df using Hmisc::wtd.quantile for a dataframe with many repeating dates. I am grouping by date, using summarize() to aggregate on date, and attempting to use wtd.quantile() on each date (with weights). This is also a pretty large dataset. Below is some sample code:

# sample data
# grouping_var = dt_time
require(Hmisc)
require(plyr)
require(dplyr)
df <- data.frame(type = sample(letters[1:2], 10e6, replace = TRUE), 
             score = sample(500:899, 10e6, replace = TRUE),
             dt_time = sample(seq(as.Date('2010/01/01'), 
                                  as.Date('2018/01/01'), 
                                  by="day"), 10e6, replace = TRUE),
             weight = sample(1.0:2.0, 10e6, replace = TRUE))
# my attempt:
ptiles <- df %>%
group_by(dt_time) %>%
plyr::ddply(~dt_time, dplyr::summarize,
            ptile10 = Hmisc::wtd.quantile(., .$score, weights = .$weight, 
probs = .1, na.rm = TRUE),
            ptile50 = Hmisc::wtd.quantile(., .$score, weights = .$weight, 
probs = .5, na.rm = TRUE),
            ptile90 = Hmisc::wtd.quantile(., .$score, weights = .$weight, 
probs = .9, na.rm = TRUE))

# desired df,
# where each new variable would be created using the
# wtd.quantile function:
desired_ptiles <- data.frame(dt_time = seq(as.Date('2010/01/01'),
                                       as.Date('2010/01/06'),
                                       by = "day"),
                         # only 6 because lol 10e6
                         ptile10 = sample(500:899, 6, replace = TRUE),
                         ptile50 = sample(500:899, 6, replace = TRUE),
                         ptile90 = sample(500:899, 6, replace = TRUE))

So far my efforts have resulted in this error:

Error in summarise_impl(.data, dots) :
Evaluation error: 'arg' must be NULL or a character vector.

And when using formula notation:

ptiles <- df %>%
    group_by(dt_time) %>%
plyr::ddply(~dt_time, dplyr::summarize,
ptile10 = Hmisc::wtd.quantile(., .$score, weights = .$weight,
                                     probs = .1, na.rm = TRUE),
ptile50 = Hmisc::wtd.quantile(., .$score, weights = .$weight, 
                                        probs = .5, na.rm = TRUE),
          ptile90 = Hmisc::wtd.quantile(., .$score, weights = .$weight,
          probs = .9, na.rm = TRUE))
# error message:
Error in summarise_impl(.data, dots) : 
  Evaluation error: 'arg' must be NULL or a character vector.

Am I approaching this in the wrong way? I've seen methods that use split() but that seems annoying. Is there a data.table method that will allow wtd.quantile() to be summarized in this way?

Thanks!

like image 288
Francisco Avatar asked Oct 17 '25 01:10

Francisco


1 Answers

You don't need ddply when using group_by, as the data is already split by the grouping. Also, you don't need to define the data within summarize after grouping.

This works:

ptiles <- df %>%
  group_by(dt_time) %>%
  summarize(ptile10 = wtd.quantile(score, weights = weight, 
                                            probs = .1, na.rm = TRUE),
              ptile50 = wtd.quantile(score, weights = weight, 
                                            probs = .5, na.rm = TRUE),
              ptile90 = wtd.quantile(score, weights = weight, 
                                            probs = .9, na.rm = TRUE))

> ptiles
# A tibble: 2,923 x 4
      dt_time ptile10 ptile50 ptile90
       <date>   <dbl>   <dbl>   <dbl>
 1 2010-01-01   539.0     697   859.0
 2 2010-01-02   538.0     704   861.7
 3 2010-01-03   541.0     706   862.0
 4 2010-01-04   541.0     702   859.0
 5 2010-01-05   540.0     706   860.0
 6 2010-01-06   537.0     695   859.0
 7 2010-01-07   539.0     696   859.0
 8 2010-01-08   536.0     700   857.0
 9 2010-01-09   538.0     694   861.0
10 2010-01-10   538.4     701   859.0
# ... with 2,913 more rows
like image 130
LAP Avatar answered Oct 18 '25 17:10

LAP