Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to interpret dplyr message `summarise()` regrouping output by 'x' (override with `.groups` argument)?

Tags:

r

dplyr

summarize

People also ask

How do I override with .groups argument?

You can override using the `. groups` argument. As you can see, the previous R code has returned the message “`summarise()` has grouped output by 'X'. You can override using the `.

What does .groups do in R?

Groupby Function in R – group_by is used to group the dataframe in R. Dplyr package in R is provided with group_by() function which groups the dataframe by multiple columns with mean, sum and other functions like count, maximum and minimum.


It is just a friendly warning message. By default, if there is any grouping before the summarise, it drops one group variable i.e. the last one specified in the group_by. If there is only one grouping variable, there won't be any grouping attribute after the summarise and if there are more than one i.e. here it is two, so, the attribute for grouping is reduce to 1 i.e. the data would have the 'year' as grouping attribute. As a reproducible example

library(dplyr)
mtcars %>%
     group_by(am) %>% 
     summarise(mpg = sum(mpg))
#`summarise()` ungrouping output (override with `.groups` argument)
# A tibble: 2 x 2
#     am   mpg
#* <dbl> <dbl>
#1     0  326.
#2     1  317.

The message is that it is ungrouping i.e when there is a single group_by, it drops that grouping after the summarise

mtcars %>% 
   group_by(am, vs) %>% 
   summarise(mpg = sum(mpg))
#`summarise()` regrouping output by 'am' (override with `.groups` argument)
# A tibble: 4 x 3
# Groups:   am [2]
#     am    vs   mpg
#  <dbl> <dbl> <dbl>
#1     0     0  181.
#2     0     1  145.
#3     1     0  118.
#4     1     1  199.

Here, it drops the last grouping and regroup with the 'am'

If we check the ?summarise, there is .groups argument which by default is "drop_last" and the other options are "drop", "keep", "rowwise"

.groups - Grouping structure of the result.

"drop_last": dropping the last level of grouping. This was the only supported option before version 1.0.0.

"drop": All levels of grouping are dropped.

"keep": Same grouping structure as .data.

"rowwise": Each row is it's own group.

When .groups is not specified, you either get "drop_last" when all the results are size 1, or "keep" if the size varies. In addition, a message informs you of that choice, unless the option "dplyr.summarise.inform" is set to FALSE.

i.e. if we change the .groups in summarise, we don't get the message because the group attributes are removed

mtcars %>% 
    group_by(am) %>%
    summarise(mpg = sum(mpg), .groups = 'drop')
# A tibble: 2 x 2
#     am   mpg
#* <dbl> <dbl>
#1     0  326.
#2     1  317.


mtcars %>%
   group_by(am, vs) %>%
   summarise(mpg = sum(mpg), .groups = 'drop')
# A tibble: 4 x 3
#     am    vs   mpg
#* <dbl> <dbl> <dbl>
#1     0     0  181.
#2     0     1  145.
#3     1     0  118.
#4     1     1  199.


mtcars %>% 
   group_by(am, vs) %>% 
   summarise(mpg = sum(mpg), .groups = 'drop') %>%
   str
#tibble [4 × 3] (S3: tbl_df/tbl/data.frame)
# $ am : num [1:4] 0 0 1 1
# $ vs : num [1:4] 0 1 0 1
# $ mpg: num [1:4] 181 145 118 199

Previously, this warning was not issued and it could lead to situations where the OP does a mutate or something else assuming there is no grouping and results in unexpected output. Now, the warning gives the user an indication that we should be careful that there is a grouping attribute

NOTE: The .groups right now is experimental in its lifecycle. So, the behaviour could be modified in the future releases

Depending upon whether we need any transformation of the data based on the same grouping variable (or not needed), we could select the different options in .groups.


The answer is explained in ?summarise: "When .groups is not specified, it is chosen based on the number of rows of the results: If all the results have 1 row, you get "drop_last". If the number of rows varies, you get "keep".".

Basically, you get such message when there is more than one option to be used as .groups= argument. The message warns you that one option has been used in the calculation of the statistics following the condition above: "drop_last" or "keep" for results with 1 or more rows, respectively. Let's say that in your pipeline for some reason you applied two or more grouping criteria but you still need to summarise the data all across values regarless grouping, this can be done by setting .group = 'drop'. Unfortunately, this is only in theory, because, as you can see in @akrun's example, statistic values remain de same, no matter which option was set in .group = (I applied these different options to one of my datasets and obtained same results and same dataframe structure ('grouping structure is controlled by the .group= argument...'). However, by specifying the argument .group, no message is printed.

The bottom line is that when using summarise, if not grouping criteria is used, the output statistic is calculated across all rows and therefore 'results have 1 row'. When one or more grouping criteria are used, the output statistic is calculated within each group and therefore 'the number of rows varies' depending on the number of groups in data frame.


Paraphrasing the accepted answer, it is just a friendly confusing warning.

summarise() has grouped output by 'xxx'

should be read: the output is OK and contains all grouping columns as attributes, only the grouping keys may be limited.

Example of grouping mtcars by cyl, am calculating mean(mpg)

mtcars %>% group_by(cyl, am) %>% summarise(avg_mpg = mean(mpg))
`summarise()` has grouped output by 'cyl'. You can override using the `.groups` argument.
# A tibble: 6 x 3
# Groups:   cyl [3]
    cyl    am avg_mpg
  <dbl> <dbl>   <dbl>
1     4     0    22.9
2     4     1    28.1
3     6     0    19.1
4     6     1    20.6
5     8     0    15.0
6     8     1    15.4

The warning is saying that in the output only the first of the original grouping keys was preserved using the default .groups = "drop_last". See the line # Groups: cyl [3].

Nevertheless, the attributes are complete, both cyl and am are defined.

Here a quick overview of the available option showing the result with the function group_keys()

mtcars %>% group_by(cyl, am) %>% summarise(avg_mpg = mean(mpg)) %>% group_keys() 
`summarise()` has grouped output by 'cyl'. You can override using the `.groups` argument.
# A tibble: 3 x 1
    cyl
  <dbl>
1     4
2     6
3     8

mtcars %>% group_by(cyl, am) %>% summarise(avg_mpg = mean(mpg), .groups = "keep") %>% group_keys() 
# A tibble: 6 x 2
    cyl    am
  <dbl> <dbl>
1     4     0
2     4     1
3     6     0
4     6     1
5     8     0
6     8     1

mtcars %>% group_by(cyl, am) %>% summarise(avg_mpg = mean(mpg), .groups = "drop") %>% group_keys() 
# A tibble: 1 x 0

The only visible consequence is while using a cascading summarization - the example below produce only one summary row as the group key were dropped.

mtcars %>% group_by(cyl, am) %>% summarise(avg_mpg = mean(mpg), .groups = "drop") %>% summarise(min_avg_mpg = min(avg_mpg))
# A tibble: 1 x 1
  min_avg_mpg
        <dbl>
1   15.0

But as the grouping attributes are all available, it should be not a problem to reset the group keys as required using group_by(cyl, am) before the subsequent summarization.