Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

dplyr group by colnames described as vector of strings

Tags:

r

dplyr

I'm trying to group_by multiple columns in my data frame and I can't write out every single column name in the group_by function so I want to call the column names as a vector like so:

cols <- colnames(mtcars)[grep("[a-z]{3,}$", colnames(mtcars))]
mtcars %>% filter(disp < 160) %>% group_by(cols) %>% summarise(n = n())

This returns error:

Error in mutate_impl(.data, dots) : 
  Column `mtcars[colnames(mtcars)[grep("[a-z]{3,}$", colnames(mtcars))]]` must be length 12 (the number of rows) or one, not 7

I definitely want to use a dplyr function to do this, but can't figure this one out.

like image 995
conv3d Avatar asked Dec 20 '17 18:12

conv3d


People also ask

What does Dplyr group by do?

group_by() takes an existing tbl and converts it into a grouped tbl where operations are performed "by group".

Can you group by multiple columns in Dplyr?

The group_by() method is used to group the data contained in the data frame based on the columns specified as arguments to the function call.

What R package is Group_by?

Group_by() function belongs to the dplyr package in the R programming language, which groups the data frames. Group_by() function alone will not give any output.


2 Answers

You can use group_by_at, where you can pass a character vector of column names as group variables:

mtcars %>% 
    filter(disp < 160) %>% 
    group_by_at(cols) %>% 
    summarise(n = n())
# A tibble: 12 x 8
# Groups:   mpg, cyl, disp, drat, qsec, gear [?]
#     mpg   cyl  disp  drat  qsec  gear  carb     n
#   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
# 1  19.7     6 145.0  3.62 15.50     5     6     1
# 2  21.4     4 121.0  4.11 18.60     4     2     1
# 3  21.5     4 120.1  3.70 20.01     3     1     1
# 4  22.8     4 108.0  3.85 18.61     4     1     1
# ...

Or you can move the column selection inside group_by_at using vars and column select helper functions:

mtcars %>% 
    filter(disp < 160) %>% 
    group_by_at(vars(matches('[a-z]{3,}$'))) %>% 
    summarise(n = n())

# A tibble: 12 x 8
# Groups:   mpg, cyl, disp, drat, qsec, gear [?]
#     mpg   cyl  disp  drat  qsec  gear  carb     n
#   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
# 1  19.7     6 145.0  3.62 15.50     5     6     1
# 2  21.4     4 121.0  4.11 18.60     4     2     1
# 3  21.5     4 120.1  3.70 20.01     3     1     1
# 4  22.8     4 108.0  3.85 18.61     4     1     1
# ...
like image 91
Psidom Avatar answered Oct 01 '22 07:10

Psidom


I believe group_by_at has now been superseded by using a combination of group_by and across. And summarise has an experimental .groups argument where you can choose how to handle the grouping after you create a summarised object. Here is an alternative to consider:

cols <- colnames(mtcars)[grep("[a-z]{3,}$", colnames(mtcars))]

original <- mtcars %>% 
  filter(disp < 160) %>% 
  group_by_at(cols) %>% 
  summarise(n = n())

superseded <- mtcars %>%
  filter(disp < 160) %>%
  group_by(across(all_of(cols))) %>%
  summarise(n = n(), .groups = 'drop_last')

all.equal(original, superseded)

Here is a blog post that goes into more detail about using the across function: https://www.tidyverse.org/blog/2020/04/dplyr-1-0-0-colwise/

like image 27
Harrison Jones Avatar answered Oct 01 '22 07:10

Harrison Jones