dplyr summarise() with multiple return values from a single function

Tags:

dplyr

I am wondering if there is a way to use functions with summarise (dplyr 0.1.2) that return multiple values (for instance the describe function from psych package).

If not, is it just because it hasn't been implemented yet, or is there a reason that it wouldn't be a good idea?

Example:

require(psych) require(ggplot2) require(dplyr)  dgrp <- group_by(diamonds, cut) describe(dgrp$price) summarise(dgrp, describe(price))

produces: Error: expecting a single value

441

asked Mar 07 '14 03:03

2 Answers

With dplyr >= 0.2 we can use do function for this:

library(ggplot2) library(psych) library(dplyr) diamonds %>%     group_by(cut) %>%     do(describe(.$price)) %>%     select(-vars) #> Source: local data frame [5 x 13] #> Groups: cut [5] #>  #>         cut     n     mean       sd median  trimmed      mad   min   max range     skew kurtosis       se #>      (fctr) (dbl)    (dbl)    (dbl)  (dbl)    (dbl)    (dbl) (dbl) (dbl) (dbl)    (dbl)    (dbl)    (dbl) #> 1      Fair  1610 4358.758 3560.387 3282.0 3695.648 2183.128   337 18574 18237 1.780213 3.067175 88.73281 #> 2      Good  4906 3928.864 3681.590 3050.5 3251.506 2853.264   327 18788 18461 1.721943 3.042550 52.56197 #> 3 Very Good 12082 3981.760 3935.862 2648.0 3243.217 2855.488   336 18818 18482 1.595341 2.235873 35.80721 #> 4   Premium 13791 4584.258 4349.205 3185.0 3822.231 3371.432   326 18823 18497 1.333358 1.072295 37.03497 #> 5     Ideal 21551 3457.542 3808.401 1810.0 2656.136 1630.860   326 18806 18480 1.835587 2.977425 25.94233

Solution based on the purrr (purrrlyr since 2017) package:

library(ggplot2) library(psych) library(purrr) diamonds %>%      slice_rows("cut") %>%      by_slice(~ describe(.x$price), .collate = "rows") #> Source: local data frame [5 x 14] #>  #>         cut  vars     n     mean       sd median  trimmed      mad   min   max range     skew kurtosis       se #>      (fctr) (dbl) (dbl)    (dbl)    (dbl)  (dbl)    (dbl)    (dbl) (dbl) (dbl) (dbl)    (dbl)    (dbl)    (dbl) #> 1      Fair     1  1610 4358.758 3560.387 3282.0 3695.648 2183.128   337 18574 18237 1.780213 3.067175 88.73281 #> 2      Good     1  4906 3928.864 3681.590 3050.5 3251.506 2853.264   327 18788 18461 1.721943 3.042550 52.56197 #> 3 Very Good     1 12082 3981.760 3935.862 2648.0 3243.217 2855.488   336 18818 18482 1.595341 2.235873 35.80721 #> 4   Premium     1 13791 4584.258 4349.205 3185.0 3822.231 3371.432   326 18823 18497 1.333358 1.072295 37.03497 #> 5     Ideal     1 21551 3457.542 3808.401 1810.0 2656.136 1630.860   326 18806 18480 1.835587 2.977425 25.94233

But it so simply with data.table:

as.data.table(diamonds)[, describe(price), by = cut] #>          cut vars     n     mean       sd median  trimmed      mad min   max range     skew kurtosis       se #> 1:     Ideal    1 21551 3457.542 3808.401 1810.0 2656.136 1630.860 326 18806 18480 1.835587 2.977425 25.94233 #> 2:   Premium    1 13791 4584.258 4349.205 3185.0 3822.231 3371.432 326 18823 18497 1.333358 1.072295 37.03497 #> 3:      Good    1  4906 3928.864 3681.590 3050.5 3251.506 2853.264 327 18788 18461 1.721943 3.042550 52.56197 #> 4: Very Good    1 12082 3981.760 3935.862 2648.0 3243.217 2855.488 336 18818 18482 1.595341 2.235873 35.80721 #> 5:      Fair    1  1610 4358.758 3560.387 3282.0 3695.648 2183.128 337 18574 18237 1.780213 3.067175 88.73281

We can write own summary function which returns a list:

fun <- function(x) {     list(n = length(x),          min = min(x),          median = as.numeric(median(x)),          mean = mean(x),          sd = sd(x),          max = max(x)) } as.data.table(diamonds)[, fun(price), by = cut] #>          cut     n min median     mean       sd   max #> 1:     Ideal 21551 326 1810.0 3457.542 3808.401 18806 #> 2:   Premium 13791 326 3185.0 4584.258 4349.205 18823 #> 3:      Good  4906 327 3050.5 3928.864 3681.590 18788 #> 4: Very Good 12082 336 2648.0 3981.760 3935.862 18818 #> 5:      Fair  1610 337 3282.0 4358.758 3560.387 18574

198

answered Oct 05 '22 20:10

Artem Klevtsov

In recent versions of the tidyverse, this is possible.

First, in the example you provided, the function returns a one-row data frame. If we use such a function in summarize(), it generates a data-frame column, which we can turn into separate columns via unpack().

library(tidyverse) library(psych)  describe(diamonds$price) #>    vars     n   mean      sd median trimmed     mad min   max range skew #> X1    1 53940 3932.8 3989.44   2401 3158.99 2475.94 326 18823 18497 1.62 #>    kurtosis    se #> X1     2.18 17.18  diamonds %>%   group_by(cut) %>%   summarize(descr = describe(price)) %>%   unpack(cols = descr) #> `summarise()` ungrouping output (override with `.groups` argument) #> # A tibble: 5 x 14 #>   cut    vars     n  mean    sd median trimmed   mad   min   max range  skew #>   <ord> <dbl> <dbl> <dbl> <dbl>  <dbl>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 Fair      1  1610 4359. 3560.  3282    3696. 2183.   337 18574 18237  1.78 #> 2 Good      1  4906 3929. 3682.  3050.   3252. 2853.   327 18788 18461  1.72 #> 3 Very…     1 12082 3982. 3936.  2648    3243. 2855.   336 18818 18482  1.60 #> 4 Prem…     1 13791 4584. 4349.  3185    3822. 3371.   326 18823 18497  1.33 #> 5 Ideal     1 21551 3458. 3808.  1810    2656. 1631.   326 18806 18480  1.84 #> # … with 2 more variables: kurtosis <dbl>, se <dbl>

Second, in some cases a function simply returns a vector as output. In those cases, summarize() generates one new row per value generated.

set.seed(1234) dsmall <- diamonds[sample(nrow(diamonds), 25), ]  unique(dsmall$clarity) #> [1] I1   SI2  VVS2 VS1  VVS1 VS2  SI1  IF   #> Levels: I1 < SI2 < SI1 < VS2 < VS1 < VVS2 < VVS1 < IF  dsmall %>%   group_by(cut) %>%   summarize(clarity = unique(clarity)) #> `summarise()` regrouping output by 'cut' (override with `.groups` argument) #> # A tibble: 17 x 2 #> # Groups:   cut [4] #>    cut       clarity #>    <ord>     <ord>   #>  1 Good      I1      #>  2 Good      SI2     #>  3 Good      VS1     #>  4 Good      SI1     #>  5 Very Good VVS2    #>  6 Very Good SI2     #>  7 Very Good VS1     #>  8 Very Good IF      #>  9 Premium   SI2     #> 10 Premium   SI1     #> 11 Ideal     VS1     #> 12 Ideal     VVS1    #> 13 Ideal     VS2     #> 14 Ideal     VVS2    #> 15 Ideal     SI1     #> 16 Ideal     SI2     #> 17 Ideal     IF

^{Created on 2020-07-14 by the reprex package (v0.3.0)}

answered Oct 05 '22 19:10

Claus Wilke

Related questions
                            
                                Split a string vector at whitespace
                            
                                How to add a page break in word document generated by RStudio & markdown
                            
                                Merging two columns into one in R [duplicate]
                            
                                Wind rose with ggplot (R)?
                            
                                R list of lists to data.frame
                            
                                Replace NA with 0 in a data frame column [duplicate]
                            
                                Remove rows where all variables are NA using dplyr
                            
                                How to assign a unique ID number to each group of identical values in a column [duplicate]
                            
                                Find names of columns which contain missing values
                            
                                error: unable to load installed packages just now
                            
                                Selecting multiple odd or even columns/rows for dataframe
                            
                                Why, for an integer vector x, does as(x, "numeric") trigger loading of an additional S4 method for coerce?
                            
                                R: Why is the [[ ]] approach for subsetting a list faster than using $?
                            
                                3d surface plot with xyz coordinates
                            
                                What is the most efficient way to select a set of variable names from an R data.frame?
                            
                                Can RStudio automatically generate an roxygen template for a function?
                            
                                Writings functions (procedures) for data.table objects
                            
                                Are there global variables in R Shiny?
                            
                                What does "Error: object '<myvariable>' not found" mean?
                            
                                Possible to show console messages (written with `message`) in a shiny ui?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With