Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

tidyverse summarize multiple columns but show result as rows

I have data where I want to get a bunch of summary statistics for multiple columns with the tidyverse approach. However, utilizing tidyverse's summarize function, it will create each column statistic as a new column, whereas I would prefer to see the column names as rows and each statistic as a new column. So my question is:

Is there a more elegant (and I know "elegant" is a vague term) way to achieve this than by accompanying the summarize function with a pivot_longer and pivot_wider?

I'm using the latest dev versions of the tidyverse package, i.e. dplyr 0.8.99.9003 and tidyr 1.1.0. So it's fine if any solution requires new functions from these packages that are not yet on CRAN.

library(tidyverse)

dat <- as.data.frame(matrix(1:100, ncol = 5))

dat %>%
  summarize(across(everything(), list(mean = mean,
                                      sum  = sum))) %>%
  pivot_longer(cols      = everything(),
               names_sep = "_",
               names_to  = c("variable", "statistic")) %>%
  pivot_wider(names_from = "statistic")

Expected outcome:

# A tibble: 5 x 3
  variable  mean   sum
  <chr>    <dbl> <dbl>
1 V1        10.5   210
2 V2        30.5   610
3 V3        50.5  1010
4 V4        70.5  1410
5 V5        90.5  1810

Note: I'm not set on the name of any of the columns, so if there's a nice way to get the structure of the table with different/generic names, that'd also be fine.

like image 657
deschen Avatar asked May 27 '20 11:05

deschen


People also ask

What does rowwise () do in R?

rowwise() allows you to compute on a data frame a row-at-a-time. This is most useful when a vectorised function doesn't exist. Most dplyr verbs preserve row-wise grouping. The exception is summarise() , which return a grouped_df.

How do I sum across rows in R dplyr?

Syntax: mutate(new-col-name = rowSums(.)) The rowSums() method is used to calculate the sum of each row and then append the value at the end of each row under the new column name specified. The argument . is used to apply the function over all the cells of the data frame. Syntax: rowSums(.)

Can you group by multiple columns in dplyr?

By using group_by() function from dplyr package we can perform group by on multiple columns or variables (two or more columns) and summarise on multiple columns for aggregations.


2 Answers

You can skip the pivot_wider step by using ".value" in names_to.

library(dplyr)

dat %>%
  summarise_all(list(mean = mean,sum  = sum)) %>%
  tidyr::pivot_longer(cols = everything(),
               names_sep = "_",
               names_to  = c("variable", ".value"))


# A tibble: 5 x 3
#  variable  mean   sum
#  <chr>    <dbl> <int>
#1 V1        10.5   210
#2 V2        30.5   610
#3 V3        50.5  1010
#4 V4        70.5  1410
#5 V5        90.5  1810
like image 198
Ronak Shah Avatar answered Sep 28 '22 08:09

Ronak Shah


not a tidyverse solution, but a data.table one instead.. also, not sure if it is more 'elegant' ;-)

but here you go...

library( data.table )
#make 'dat' a data.table
setDT(dat)
#transpose, keeping column names
dat <- transpose(dat, keep.names = "var_name" )
#melt to long and summarise
melt(dat, id.vars = "var_name")[, .(mean = mean(value), sum = sum(value) ), by = var_name]


#    var_name mean  sum
# 1:       V1 10.5  210
# 2:       V2 30.5  610
# 3:       V3 50.5 1010
# 4:       V4 70.5 1410
# 5:       V5 90.5 1810
like image 43
Wimpel Avatar answered Sep 28 '22 08:09

Wimpel