I am wondering if there is a way to use functions with summarise
(dplyr 0.1.2
) that return multiple values (for instance the describe
function from psych
package).
If not, is it just because it hasn't been implemented yet, or is there a reason that it wouldn't be a good idea?
Example:
require(psych) require(ggplot2) require(dplyr) dgrp <- group_by(diamonds, cut) describe(dgrp$price) summarise(dgrp, describe(price))
produces: Error: expecting a single value
%>% is called the forward pipe operator in R. It provides a mechanism for chaining commands with a new forward-pipe operator, %>%. This operator will forward a value, or the result of an expression, into the next function call/expression. It is defined by the package magrittr (CRAN) and is heavily used by dplyr (CRAN).
Summarize Function in R Programming. As its name implies, the summarize function reduces a data frame to a summary of just one vector or value. Many times, these summaries are calculated by grouping observations using a factor or categorical variables first.
n() gives the current group size. cur_data() gives the current data for the current group (excluding grouping variables).
With dplyr
>= 0.2 we can use do
function for this:
library(ggplot2) library(psych) library(dplyr) diamonds %>% group_by(cut) %>% do(describe(.$price)) %>% select(-vars) #> Source: local data frame [5 x 13] #> Groups: cut [5] #> #> cut n mean sd median trimmed mad min max range skew kurtosis se #> (fctr) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) #> 1 Fair 1610 4358.758 3560.387 3282.0 3695.648 2183.128 337 18574 18237 1.780213 3.067175 88.73281 #> 2 Good 4906 3928.864 3681.590 3050.5 3251.506 2853.264 327 18788 18461 1.721943 3.042550 52.56197 #> 3 Very Good 12082 3981.760 3935.862 2648.0 3243.217 2855.488 336 18818 18482 1.595341 2.235873 35.80721 #> 4 Premium 13791 4584.258 4349.205 3185.0 3822.231 3371.432 326 18823 18497 1.333358 1.072295 37.03497 #> 5 Ideal 21551 3457.542 3808.401 1810.0 2656.136 1630.860 326 18806 18480 1.835587 2.977425 25.94233
Solution based on the purrr
(purrrlyr
since 2017) package:
library(ggplot2) library(psych) library(purrr) diamonds %>% slice_rows("cut") %>% by_slice(~ describe(.x$price), .collate = "rows") #> Source: local data frame [5 x 14] #> #> cut vars n mean sd median trimmed mad min max range skew kurtosis se #> (fctr) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) (dbl) #> 1 Fair 1 1610 4358.758 3560.387 3282.0 3695.648 2183.128 337 18574 18237 1.780213 3.067175 88.73281 #> 2 Good 1 4906 3928.864 3681.590 3050.5 3251.506 2853.264 327 18788 18461 1.721943 3.042550 52.56197 #> 3 Very Good 1 12082 3981.760 3935.862 2648.0 3243.217 2855.488 336 18818 18482 1.595341 2.235873 35.80721 #> 4 Premium 1 13791 4584.258 4349.205 3185.0 3822.231 3371.432 326 18823 18497 1.333358 1.072295 37.03497 #> 5 Ideal 1 21551 3457.542 3808.401 1810.0 2656.136 1630.860 326 18806 18480 1.835587 2.977425 25.94233
But it so simply with data.table
:
as.data.table(diamonds)[, describe(price), by = cut] #> cut vars n mean sd median trimmed mad min max range skew kurtosis se #> 1: Ideal 1 21551 3457.542 3808.401 1810.0 2656.136 1630.860 326 18806 18480 1.835587 2.977425 25.94233 #> 2: Premium 1 13791 4584.258 4349.205 3185.0 3822.231 3371.432 326 18823 18497 1.333358 1.072295 37.03497 #> 3: Good 1 4906 3928.864 3681.590 3050.5 3251.506 2853.264 327 18788 18461 1.721943 3.042550 52.56197 #> 4: Very Good 1 12082 3981.760 3935.862 2648.0 3243.217 2855.488 336 18818 18482 1.595341 2.235873 35.80721 #> 5: Fair 1 1610 4358.758 3560.387 3282.0 3695.648 2183.128 337 18574 18237 1.780213 3.067175 88.73281
We can write own summary function which returns a list:
fun <- function(x) { list(n = length(x), min = min(x), median = as.numeric(median(x)), mean = mean(x), sd = sd(x), max = max(x)) } as.data.table(diamonds)[, fun(price), by = cut] #> cut n min median mean sd max #> 1: Ideal 21551 326 1810.0 3457.542 3808.401 18806 #> 2: Premium 13791 326 3185.0 4584.258 4349.205 18823 #> 3: Good 4906 327 3050.5 3928.864 3681.590 18788 #> 4: Very Good 12082 336 2648.0 3981.760 3935.862 18818 #> 5: Fair 1610 337 3282.0 4358.758 3560.387 18574
In recent versions of the tidyverse, this is possible.
First, in the example you provided, the function returns a one-row data frame. If we use such a function in summarize()
, it generates a data-frame column, which we can turn into separate columns via unpack()
.
library(tidyverse) library(psych) describe(diamonds$price) #> vars n mean sd median trimmed mad min max range skew #> X1 1 53940 3932.8 3989.44 2401 3158.99 2475.94 326 18823 18497 1.62 #> kurtosis se #> X1 2.18 17.18 diamonds %>% group_by(cut) %>% summarize(descr = describe(price)) %>% unpack(cols = descr) #> `summarise()` ungrouping output (override with `.groups` argument) #> # A tibble: 5 x 14 #> cut vars n mean sd median trimmed mad min max range skew #> <ord> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> #> 1 Fair 1 1610 4359. 3560. 3282 3696. 2183. 337 18574 18237 1.78 #> 2 Good 1 4906 3929. 3682. 3050. 3252. 2853. 327 18788 18461 1.72 #> 3 Very… 1 12082 3982. 3936. 2648 3243. 2855. 336 18818 18482 1.60 #> 4 Prem… 1 13791 4584. 4349. 3185 3822. 3371. 326 18823 18497 1.33 #> 5 Ideal 1 21551 3458. 3808. 1810 2656. 1631. 326 18806 18480 1.84 #> # … with 2 more variables: kurtosis <dbl>, se <dbl>
Second, in some cases a function simply returns a vector as output. In those cases, summarize()
generates one new row per value generated.
set.seed(1234) dsmall <- diamonds[sample(nrow(diamonds), 25), ] unique(dsmall$clarity) #> [1] I1 SI2 VVS2 VS1 VVS1 VS2 SI1 IF #> Levels: I1 < SI2 < SI1 < VS2 < VS1 < VVS2 < VVS1 < IF dsmall %>% group_by(cut) %>% summarize(clarity = unique(clarity)) #> `summarise()` regrouping output by 'cut' (override with `.groups` argument) #> # A tibble: 17 x 2 #> # Groups: cut [4] #> cut clarity #> <ord> <ord> #> 1 Good I1 #> 2 Good SI2 #> 3 Good VS1 #> 4 Good SI1 #> 5 Very Good VVS2 #> 6 Very Good SI2 #> 7 Very Good VS1 #> 8 Very Good IF #> 9 Premium SI2 #> 10 Premium SI1 #> 11 Ideal VS1 #> 12 Ideal VVS1 #> 13 Ideal VS2 #> 14 Ideal VVS2 #> 15 Ideal SI1 #> 16 Ideal SI2 #> 17 Ideal IF
Created on 2020-07-14 by the reprex package (v0.3.0)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With