Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

dplyr - summary table for multiple variables

Tags:

r

dplyr

How to create simple summary statistics using dplyr from multiple variables? Using the summarise_each function seems to be the way to go, however, when applying multiple functions to multiple columns, the result is a wide, hard-to-read data frame.

like image 312
paljenczy Avatar asked Jan 04 '16 15:01

paljenczy


People also ask

How do I Summary multiple variables in R?

Summarise multiple variables The functions summarise_all() , summarise_at() and summarise_if() can be used to summarise multiple columns at once.

How do I get summary statistics for variables in R?

R provides a wide range of functions for obtaining summary statistics. One method of obtaining descriptive statistics is to use the sapply( ) function with a specified summary statistic. Possible functions used in sapply include mean, sd, var, min, max, median, range, and quantile.

How do I create a summary statistics table in R?

The easiest way to create summary tables in R is to use the describe() and describeBy() functions from the psych library.

How do you group by and summarize in R?

Group By Summarise R ExampleTo get the dropped dataframe use group_by() function. To use group_by() and summarize() functions, you have to install dplyr first using install. packages('dplyr') and load it using library(dplyr) . All functions in dplyr package take data.


1 Answers

Use dplyr in combination with tidyr to reshape the end result.

library(dplyr)
library(tidyr)

df <- tbl_df(mtcars)

df.sum <- df %>%
  select(mpg, cyl, vs, am, gear, carb) %>% # select variables to summarise
  summarise_each(funs(min = min, 
                      q25 = quantile(., 0.25), 
                      median = median, 
                      q75 = quantile(., 0.75), 
                      max = max,
                      mean = mean, 
                      sd = sd))

# the result is a wide data frame
> dim(df.sum)
[1]  1 42

# reshape it using tidyr functions

df.stats.tidy <- df.sum %>% gather(stat, val) %>%
  separate(stat, into = c("var", "stat"), sep = "_") %>%
  spread(stat, val) %>%
  select(var, min, q25, median, q75, max, mean, sd) # reorder columns

> print(df.stats.tidy)

   var  min    q25 median  q75  max     mean        sd
1   am  0.0  0.000    0.0  1.0  1.0  0.40625 0.4989909
2 carb  1.0  2.000    2.0  4.0  8.0  2.81250 1.6152000
3  cyl  4.0  4.000    6.0  8.0  8.0  6.18750 1.7859216
4 gear  3.0  3.000    4.0  4.0  5.0  3.68750 0.7378041
5  mpg 10.4 15.425   19.2 22.8 33.9 20.09062 6.0269481
6   vs  0.0  0.000    0.0  1.0  1.0  0.43750 0.5040161
like image 146
paljenczy Avatar answered Oct 07 '22 01:10

paljenczy