Apply a summarise condition to a range of columns when using dplyr group_by?

Tags:

Suppose we want to group_by() and summarise a massive data.frame with very many columns, but that there are some large groups of consecutive columns that will have the same summarise condition (e.g. max, mean etc)

Is there a way to avoid having to specify the summarise condition for each and every column, and instead do it for ranges of columns?

Example

Suppose we want to do this:

iris %>% 
  group_by(Species) %>% 
  summarise(max(Sepal.Length), mean(Sepal.Width), mean(Petal.Length), mean(Petal.Width))

but note that 3 consecutive columns have the same summarise condition, mean(Sepal.Width), mean(Petal.Length), mean(Petal.Width)

Is there a way to use some method like mean(Sepal.Width:Petal.Width) to specify the condition for the range of columns, and hence a avoiding having to type out the summarise condition multiple times for all the columns in between)

Note

The iris example above is a small and manageable example that has a range of 3 consecutive columns, but actual use case has ~hundreds.

435

asked Mar 07 '20 04:03

stevec

Video Answer

1 Answers

The upcoming version 1.0.0 of dplyr will have across() function that does what you wish for

Basic usage

across() has two primary arguments:

The first argument, .cols, selects the columns you want to operate on. It uses tidy selection (like select()) so you can pick variables by position, name, and type.

The second argument, .fns, is a function or list of functions to apply to each column. This can also be a purrr style formula (or list of formulas) like ~ .x / 2. (This argument is optional, and you can omit it if you just want to get the underlying data; you'll see that technique used in vignette("rowwise").)

### Install development version on GitHub first
# install.packages("devtools")
# devtools::install_github("tidyverse/dplyr")
library(dplyr, warn.conflicts = FALSE)

Control how the names are created with the .names argument which takes a glue spec:

iris %>% 
  group_by(Species) %>% 
  summarise(
    across(c(Sepal.Width:Petal.Width), ~ mean(.x, na.rm = TRUE), .names = "mean_{col}"),
    across(c(Sepal.Length), ~ max(.x, na.rm = TRUE), .names = "max_{col}")
    )
#> # A tibble: 3 x 5
#>   Species    mean_Sepal.Width mean_Petal.Leng~ mean_Petal.Width max_Sepal.Length
#> * <fct>                 <dbl>            <dbl>            <dbl>            <dbl>
#> 1 setosa                 3.43             1.46            0.246              5.8
#> 2 versicolor             2.77             4.26            1.33               7  
#> 3 virginica              2.97             5.55            2.03               7.9

Using multiple functions

my_func <- list(
  mean = ~ mean(., na.rm = TRUE),
  max  = ~ max(., na.rm = TRUE)
)

iris %>%
  group_by(Species) %>%
  summarise(across(where(is.numeric), my_func, .names = "{fn}.{col}"))
#> # A tibble: 3 x 9
#>   Species    mean.Sepal.Length max.Sepal.Length mean.Sepal.Width max.Sepal.Width
#> * <fct>                  <dbl>            <dbl>            <dbl>           <dbl>
#> 1 setosa                  5.01              5.8             3.43             4.4
#> 2 versicolor              5.94              7               2.77             3.4
#> 3 virginica               6.59              7.9             2.97             3.8
#>   mean.Petal.Length max.Petal.Length mean.Petal.Width max.Petal.Width
#> *             <dbl>            <dbl>            <dbl>           <dbl>
#> 1              1.46              1.9            0.246             0.6
#> 2              4.26              5.1            1.33              1.8
#> 3              5.55              6.9            2.03              2.5

^{Created on 2020-03-06 by the reprex package (v0.3.0)}

136

answered Sep 28 '22 00:09

Tung

Related questions
                            
                                R dplyr: change the row value of columns having an specific name
                            
                                R filterings rows that contain a combination of words
                            
                                r sf package centroid within polygon
                            
                                How to show formatted R output with results='asis' in rmarkdown
                            
                                How to expand ggplot y axis limits to include maximum value
                            
                                What is the opposite function of max.col (R language)
                            
                                Efficient way to fill column with numbers that identify observations with same value in column [duplicate]
                            
                                How to bind rows without losing those with character(0)?
                            
                                Apply color brewer to a single line in ggplot
                            
                                Override horizontal positioning with ggrepel
                            
                                grepl on two vectors element by element
                            
                                R remove duplicate rows keeping those with values
                            
                                How do I use facetting correctly in ggplot geom_tile, while keeping the aspect ratio intact?
                            
                                Chop off the first letter of every variable name [duplicate]
                            
                                Dplyr filter top and bottom rows by value simultaneously on grouped data
                            
                                Compare the words from a data frame and calculate a matrix with the length of the biggest word for each pair
                            
                                How to make a frequency table by class [duplicate]
                            
                                One hot encode list of vectors
                            
                                Cumulative sum in R by group and start over when sum of values in group larger than maximum value
                            
                                Fail to render an animation

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Apply a summarise condition to a range of columns when using dplyr group_by?

Tags:

r

group-by

dplyr