Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Extra statistics with summarize_at in dplyr

Tags:

r

dplyr

Is there a way to add extra statistics to a summarize_at call? For example

iris %>% group_by(Species) %>% summarise_at(vars(), funs(mean, sd))

will compute the means and standard deviations of 4 columns (giving a total of 8 columns). Suppose I also wanted to know how many rows were in each group. I.e., something like

# Below is not valid syntax 
iris %>% 
  group_by(Species) %>% 
  summarise_at(vars(), funs(mean, sd)) + summarise(n())

Given that the above does not work a kludge is

iris %>% group_by(Species) %>% summarise_at(vars(), funs(mean, sd, length))

which produces, in effect, 4 copies of the count column.

Perhaps this is beyond what can be conveniently handled by summarize_at and friends?

like image 343
banbh Avatar asked Apr 24 '17 18:04

banbh


3 Answers

How about this:

iris %>% 
    group_by(Species) %>% 
    mutate(Count = n()) %>%
    group_by(Species, Count) %>%
    summarize_at(vars(), funs(mean, sd))
like image 108
Tim Goodman Avatar answered Oct 17 '22 16:10

Tim Goodman


We can do this with data.table in a more flexible way

library(data.table)
as.data.table(iris)[, c(n = .N, unlist(lapply(.SD, function(x) 
    list(Mean=mean(x), SD=sd(x))), recursive = FALSE)), .(Species)]
# Species  n Sepal.Length.Mean Sepal.Length.SD Sepal.Width.Mean Sepal.Width.SD Petal.Length.Mean Petal.Length.SD Petal.Width.Mean
#1:     setosa 50             5.006       0.3524897            3.428      0.3790644             1.462       0.1736640            0.246
#2: versicolor 50             5.936       0.5161711            2.770      0.3137983             4.260       0.4699110            1.326
#3:  virginica 50             6.588       0.6358796            2.974      0.3224966             5.552       0.5518947            2.026
#   Petal.Width.SD
#1:      0.1053856
#2:      0.1977527
#3:      0.2746501

Or using dplyr, we may need to do a join

iris1 <- iris %>%
             group_by(Species) %>% 
             summarise_all(funs(mean, sd))

iris %>% 
     group_by(Species) %>% 
     summarise(n = n()) %>%
     full_join(iris1)

Or with bind_cols

iris %>%
 group_by(Species) %>% 
 summarise_all(funs(mean, sd)) %>% bind_cols(., iris %>% count(Species) %>% select(-Species))
# A tibble: 3 × 10
#     Species Sepal.Length_mean Sepal.Width_mean Petal.Length_mean Petal.Width_mean Sepal.Length_sd Sepal.Width_sd Petal.Length_sd Petal.Width_sd     n
#      <fctr>             <dbl>            <dbl>             <dbl>            <dbl>           <dbl>          <dbl>           <dbl>          <dbl> <int>
#1     setosa             5.006            3.428             1.462            0.246       0.3524897      0.3790644       0.1736640      0.1053856    50
#2 versicolor             5.936            2.770             4.260            1.326       0.5161711      0.3137983       0.4699110      0.1977527    50
#3  virginica             6.588            2.974             5.552            2.026       0.6358796      0.3224966       0.5518947      0.2746501    50
like image 3
akrun Avatar answered Oct 17 '22 14:10

akrun


To specify on which column to apply the statistics:

iris %>%   group_by(Species) %>% 
     mutate(Count = n()) %>%
     group_by(Species, Count) %>%
     summarize_at(vars(Sepal.Length)), funs(mean, sd)) -> dt_stat
dt_stat

or to apply on all columns starting with "Sepal" :

iris %>%   group_by(Species) %>% 
     mutate(Count = n()) %>%
     group_by(Species, Count) %>%
     summarize_at(vars(starts_with("Sepal")), funs(mean, sd)) -> dt_stat2
dt_stat2
like image 3
KK_63 Avatar answered Oct 17 '22 16:10

KK_63