dplyr group by external variable [closed]

Question

I have legacy code that makes extensive use of dplyr to group by programatically specified variables in a way that is currently either deprecated or superseded. Reproducible examples are given below. I would like to update this code with a stable option to ensure that this continues to work with future versions of dplyr. There appear to be several alternative methods that can be demonstrated to give the same result as the original code in a simple case, but would like to know if these are truly equivalent in edge cases. Skipping over the early years necessitating the use quo, enquo, sym, !!, !!! etc. to get around the challenges of programming with NSE, the first example is group_by_(), as in:

library(dplyr)
Var1 <- "gear"
Var2 <- "cyl"

test1 <- mtcars %>% 
group_by_(Var1, Var2) %>% 
summarise(Mean_mpg = mean(mpg))

This worked fine, and still appears to do so, but comes up with a warning that group_by_() was deprecated in dplyr 0.7.0.

The next option used in some legacy code is:

test2 <- mtcars %>% 
group_by_at(c(Var1, Var2)) %>% 
summarise(Mean_mpg = mean(mpg))

This also runs OK, but the documentation lists this as superseded, and suggests the use of across(). Following that advice:

test3 <- mtcars %>% 
group_by(across(c(Var1, Var2))) %>% 
summarise(Mean_mpg = mean(mpg))

This works, but gives the warning: "Using an external vector in selections was deprecated in tidyselect 1.1.0. ℹ Please use all_of() or any_of() instead."

Taking this advice (maybe should be worded "as well" not "instead"?):

test4 <- mtcars %>% 
group_by(across(all_of(c(Var1, Var2)))) %>% 
summarise(Mean_mpg = mean(mpg))

The "Programming with dplyr" vignette introduces yet another way of doing this:

test5 <- mtcars %>% 
group_by(across(c({{Var1}}, {{Var2}}))) %>% 
summarise(Mean_mpg = mean(mpg))

All five of these give identical results in dplyr version 1.1.4 for this simple case:

sapply(list(test2, test3, test4, test5), identical, test1)

I appreciate that across() etc. have widespread other uses, but just for the purpose of passing a small number of variables to a grouping function, are there specific under-the-hood reasons (performance, error-trapping etc.) that mean that working production code of the form in test1 and test2 should be updated, and if so what is the latest preferred form? In other words, is:

group_by_(Var1, Var2)

the same as:

group_by(across(all_of(c(Var1, Var2))))?

Also, I know this is an impossible question to answer definitively, but does anyone have an inside track into how long group_by_() and group_by_at() are likely to be around, i.e. at what point will legacy code containing these will start to fail?

L Tyrone · Accepted Answer

In your use case, you don't need group_by(), you can add a .by = inside summarise(). Also, you don't need across() either. So this is an option:

library(dplyr)

Var1 <- "gear"
Var2 <- "cyl"

mtcars |>
  summarise(Mean_mpg = mean(mpg), .by = all_of(c(Var1, Var2)))

#   gear cyl Mean_mpg
# 1    4   6   19.750
# 2    4   4   26.925
# 3    3   6   19.750
# 4    3   8   15.050
# 5    3   4   21.500
# 6    5   4   28.200
# 7    5   8   15.400
# 8    5   6   19.700

G. Grothendieck · Answer

Have revised this after a commenter pointed out a problem.

My understanding is that group_by_ is deprecated so it is going away. group_by_at is superseded (not deprecated) so it is not going away but less preferred by the dplyr authors.

These do not result in errors or warnings even though we have made warnings to be always on in the options statement above.

library(dplyr)
packageVersion("dplyr")      # 1.1.4
packageVersion("tidyselect") # 1.2.1
options(lifecycle_verbosity = "warning")  # force warnings

Var1 <- "cyl"; Var2 <- "gear"
mtcars %>% group_by(pick(any_of(c(Var1, Var2))))
mtcars %>% group_by(.data[[Var1]], .data[[Var2]])

If the code is in a function then

f1 <- function(data, Var1, Var2) data %>% group_by(pick(any_of(c(Var1, Var2))))
Var1 <- "cyl"; Var2 <- "gear"; f1(mtcars, Var1, Var2)

f2 <- function(data, Var1, Var2) data %>% group_by(.data[[Var1]], .data[[Var2]])
Var1 <- "cyl"; Var2 <- "gear"; f2(mtcars, Var1, Var2)

The key thing to understand is that pick uses tidy-select for its arguments (but group_by uses data masking which is different than tidy-select) so read up on that as all verbs that use tidy-select work the same way. Also see the Programming with dplyr vignette.

Also mutate, summary, reframe, filter and slice each support a .by= tidy-select argument and the slice_* functions support a similar by= (no dot) tidy-select argument. These can be used in place of group_by(pick(...)) for a single statement. However, note that the result of .by=... and group_by(pick(...)) may not be 100% identical even if the ... are the same because the order of the output rows may differ.

dplyr group by external variable [closed]

Tags:

r

dplyr

Knackiedoo

2 Answers

L Tyrone

G. Grothendieck

Recent Activity

Donate For Us

dplyr group by external variable [closed]

Tags:

r

dplyr

Knackiedoo

2 Answers

L Tyrone

G. Grothendieck

Related questions

Recent Activity

Donate For Us