Suppose we want to group_by()
and summarise
a massive data.frame with very many columns, but that there are some large groups of consecutive columns that will have the same summarise
condition (e.g. max
, mean
etc)
Is there a way to avoid having to specify the summarise
condition for each and every column, and instead do it for ranges of columns?
Suppose we want to do this:
iris %>%
group_by(Species) %>%
summarise(max(Sepal.Length), mean(Sepal.Width), mean(Petal.Length), mean(Petal.Width))
but note that 3 consecutive columns have the same summarise
condition, mean(Sepal.Width), mean(Petal.Length), mean(Petal.Width)
Is there a way to use some method like mean(Sepal.Width:Petal.Width)
to specify the condition for the range of columns, and hence a avoiding having to type out the summarise condition multiple times for all the columns in between)
The iris example above is a small and manageable example that has a range of 3 consecutive columns, but actual use case has ~hundreds.
You can override using the `.groups` argument.” is that the dplyr package drops the last group variable that was specified in the group_by function, in case we are using multiple columns to group our data before applying the summarise function. This message helps to make the user aware that a grouping was performed.
The reason for the message “`summarise ()` has grouped output by ‘X’. You can override using the `.groups` argument.” is that the dplyr package drops the last group variable that was specified in the group_by function, in case we are using multiple columns to group our data before applying the summarise function.
If a variable in .vars is named, a new column by that name will be created. Name collisions in the new columns are disambiguated using a unique suffix. The functions are maturing, because the naming scheme and the disambiguation algorithm are subject to change in dplyr 0.9.0.
Name collisions in the new columns are disambiguated using a unique suffix. The functions are maturing, because the naming scheme and the disambiguation algorithm are subject to change in dplyr 0.9.0. # The _if () variants apply a predicate function (a function that # returns TRUE or FALSE) to determine the relevant subset of # columns.
The upcoming version 1.0.0 of dplyr
will have across()
function that does what you wish for
Basic usage
across()
has two primary arguments:
- The first argument,
.cols
, selects the columns you want to operate on. It uses tidy selection (likeselect()
) so you can pick variables by position, name, and type.
- The second argument,
.fns
, is a function or list of functions to apply to each column. This can also be a purrr style formula (or list of formulas) like~ .x / 2
. (This argument is optional, and you can omit it if you just want to get the underlying data; you'll see that technique used invignette("rowwise")
.)
### Install development version on GitHub first
# install.packages("devtools")
# devtools::install_github("tidyverse/dplyr")
library(dplyr, warn.conflicts = FALSE)
Control how the names are created with the .names
argument which takes a glue spec:
iris %>%
group_by(Species) %>%
summarise(
across(c(Sepal.Width:Petal.Width), ~ mean(.x, na.rm = TRUE), .names = "mean_{col}"),
across(c(Sepal.Length), ~ max(.x, na.rm = TRUE), .names = "max_{col}")
)
#> # A tibble: 3 x 5
#> Species mean_Sepal.Width mean_Petal.Leng~ mean_Petal.Width max_Sepal.Length
#> * <fct> <dbl> <dbl> <dbl> <dbl>
#> 1 setosa 3.43 1.46 0.246 5.8
#> 2 versicolor 2.77 4.26 1.33 7
#> 3 virginica 2.97 5.55 2.03 7.9
Using multiple functions
my_func <- list(
mean = ~ mean(., na.rm = TRUE),
max = ~ max(., na.rm = TRUE)
)
iris %>%
group_by(Species) %>%
summarise(across(where(is.numeric), my_func, .names = "{fn}.{col}"))
#> # A tibble: 3 x 9
#> Species mean.Sepal.Length max.Sepal.Length mean.Sepal.Width max.Sepal.Width
#> * <fct> <dbl> <dbl> <dbl> <dbl>
#> 1 setosa 5.01 5.8 3.43 4.4
#> 2 versicolor 5.94 7 2.77 3.4
#> 3 virginica 6.59 7.9 2.97 3.8
#> mean.Petal.Length max.Petal.Length mean.Petal.Width max.Petal.Width
#> * <dbl> <dbl> <dbl> <dbl>
#> 1 1.46 1.9 0.246 0.6
#> 2 4.26 5.1 1.33 1.8
#> 3 5.55 6.9 2.03 2.5
Created on 2020-03-06 by the reprex package (v0.3.0)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With