Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

dplyr summarize: create variables from named vector

Tags:

r

dplyr

Here's my problem:

I am using a function that returns a named vector. Here's a toy example:

toy_fn <- function(x) {
    y <- c(mean(x), sum(x), median(x), sd(x))
    names(y) <- c("Right", "Wrong", "Unanswered", "Invalid")
    y
}

I am using group_by in dplyr to apply this function for each group (typical split-apply-combine). So, here's my toy data.frame:

set.seed(1234567)
toy_df <- data.frame(id = 1:1000, 
                     group = sample(letters, 1000, replace = TRUE), 
                     value = runif(1000))

And here's the result I am aiming for:

toy_summary <- 
    toy_df %>% 
    group_by(group) %>% 
    summarize(Right = toy_fn(value)["Right"], 
              Wrong = toy_fn(value)["Wrong"], 
              Unanswered = toy_fn(value)["Unanswered"], 
              Invalid = toy_fn(value)["Invalid"])

> toy_summary
Source: local data frame [26 x 5]

   group     Right    Wrong Unanswered   Invalid
1      a 0.5038394 20.15358  0.5905526 0.2846468
2      b 0.5048040 15.64892  0.5163702 0.2994544
3      c 0.5029442 21.62660  0.5072733 0.2465612
4      d 0.5124601 14.86134  0.5382463 0.2681955
5      e 0.4649483 17.66804  0.4426197 0.3075080
6      f 0.5622644 12.36982  0.6330269 0.2850609
7      g 0.4675324 14.96104  0.4692404 0.2746589

It works! But it is just not cool to call four times the same function. I would rather like dplyr to get the named vector and create a new variable for each element in the vector. Something like this:

toy_summary <- 
    toy_df %>% 
    group_by(group) %>% 
    summarize(toy_fn(value))

This, unfortunately, does not work because "Error: expecting a single value".

I thought, ok, let's just convert the vector to a data.frame using data.frame(as.list(x)). But this does not work either. I tried many things but I couldn't trick dplyr into think it's actually receiving one single value (observation) for 4 different variables. Is there any way to help dplyr realize that?.

like image 918
Hernando Casas Avatar asked May 25 '15 15:05

Hernando Casas


4 Answers

One possible solution is to use dplyr SE capabilities. For example, set you function as follows

dots <- setNames(list(  ~ mean(value),  
                         ~ sum(value),  
                      ~ median(value), 
                         ~ sd(value)),  
                 c("Right", "Wrong", "Unanswered", "Invalid"))

Then, you can use summarize_ (with a _) as follows

toy_df %>% 
  group_by(group) %>% 
  summarize_(.dots = dots)
# Source: local data table [26 x 5]
# 
#    group     Right    Wrong Unanswered   Invalid
# 1      o 0.4490776 17.51403  0.4012057 0.2749956
# 2      s 0.5079569 15.23871  0.4663852 0.2555774
# 3      x 0.4620649 14.78608  0.4475117 0.2894502
# 4      a 0.5038394 20.15358  0.5905526 0.2846468
# 5      t 0.5041168 24.19761  0.5330790 0.3171022
# 6      m 0.4806628 21.14917  0.4805273 0.2825026
# 7      c 0.5029442 21.62660  0.5072733 0.2465612
# 8      w 0.4932484 17.75694  0.4891746 0.3309680
# 9      q 0.5350707 22.47297  0.5608505 0.2749941
# 10     g 0.4675324 14.96104  0.4692404 0.2746589
# ..   ...       ...      ...        ...       ...

Though it looks nice, there is a big catch here. You have to know the column you are going to operate on a priori (value) when setting up the function, so it won't work on some other column name, if you won't set up dots properly.


As a bonus here's a simple solution using data.table using your original function

library(data.table)
setDT(toy_df)[, as.list(toy_fn(value)), by = group]
#     group     Right    Wrong Unanswered   Invalid
#  1:     o 0.4490776 17.51403  0.4012057 0.2749956
#  2:     s 0.5079569 15.23871  0.4663852 0.2555774
#  3:     x 0.4620649 14.78608  0.4475117 0.2894502
#  4:     a 0.5038394 20.15358  0.5905526 0.2846468
#  5:     t 0.5041168 24.19761  0.5330790 0.3171022
#  6:     m 0.4806628 21.14917  0.4805273 0.2825026
#  7:     c 0.5029442 21.62660  0.5072733 0.2465612
#  8:     w 0.4932484 17.75694  0.4891746 0.3309680
#  9:     q 0.5350707 22.47297  0.5608505 0.2749941
# 10:     g 0.4675324 14.96104  0.4692404 0.2746589
#...
like image 145
David Arenburg Avatar answered Oct 23 '22 18:10

David Arenburg


You can also try this with do():

toy_df %>%
  group_by(group) %>%
  do(res = toy_fn(.$value))
like image 20
Josh W. Avatar answered Oct 23 '22 18:10

Josh W.


This is not a dplyr solution, but if you like pipes:

library(magrittr)

toy_summary <-
  toy_df %>% 
  split(.$group) %>% 
  lapply( function(x) toy_fn(x$value) ) %>% 
  do.call(rbind, .)

# > head(toy_summary)
#         Right    Wrong Unanswered   Invalid
#   a 0.5038394 20.15358  0.5905526 0.2846468
#   b 0.5048040 15.64892  0.5163702 0.2994544
#   c 0.5029442 21.62660  0.5072733 0.2465612
#   d 0.5124601 14.86134  0.5382463 0.2681955
#   e 0.4649483 17.66804  0.4426197 0.3075080
#   f 0.5622644 12.36982  0.6330269 0.2850609      
like image 40
bergant Avatar answered Oct 23 '22 17:10

bergant


Apparently there's a problem when using median (not sure what's going on there) but apart from that you can normally use an approach like the following with summarise_each to apply multiple functions. Note that you can specify the names of resulting columns by using a named vector as input to funs_():

x <- c(Right = "mean", Wrong = "sd", Unanswered = "sum")

toy_df %>% 
  group_by(group) %>% 
  summarise_each(funs_(x), value)

#Source: local data frame [26 x 4]
#
#   group     Right     Wrong Unanswered
#1      a 0.5038394 0.2846468   20.15358
#2      b 0.5048040 0.2994544   15.64892
#3      c 0.5029442 0.2465612   21.62660
#4      d 0.5124601 0.2681955   14.86134
#5      e 0.4649483 0.3075080   17.66804
#6      f 0.5622644 0.2850609   12.36982
#7      g 0.4675324 0.2746589   14.96104
#8      h 0.4921506 0.2879830   21.16248
#9      i 0.5443600 0.2945428   22.31876
#10     j 0.5276048 0.3236814   20.57659
#..   ...       ...       ...        ...
like image 3
talat Avatar answered Oct 23 '22 18:10

talat