Group by multiple columns in dplyr, using string vector input

People also ask

Can you group by multiple columns in Dplyr?

The group_by() method is used to group the data contained in the data frame based on the columns specified as arguments to the function call.

What does Groupby () do in R?

Groupby Function in R – group_by is used to group the dataframe in R. Dplyr package in R is provided with group_by() function which groups the dataframe by multiple columns with mean, sum and other functions like count, maximum and minimum.

How do I select variables in Dplyr?

How To Select A Variable by name with dplyr select()? We can select a variable from a data frame using select() function in two ways. One way is to specify the dataframe name and the variable/column name we want to select as arguments to select() function in dplyr.

Just so as to write the code in full, here's an update on Hadley's answer with the new syntax:

library(dplyr)

df <-  data.frame(
    asihckhdoydk = sample(LETTERS[1:3], 100, replace=TRUE),
    a30mvxigxkgh = sample(LETTERS[1:3], 100, replace=TRUE),
    value = rnorm(100)
)

# Columns you want to group by
grp_cols <- names(df)[-3]

# Convert character vector to list of symbols
dots <- lapply(grp_cols, as.symbol)

# Perform frequency counts
df %>%
    group_by_(.dots=dots) %>%
    summarise(n = n())

output:

Source: local data frame [9 x 3]
Groups: asihckhdoydk

  asihckhdoydk a30mvxigxkgh  n
1            A            A 10
2            A            B 10
3            A            C 13
4            B            A 14
5            B            B 10
6            B            C 12
7            C            A  9
8            C            B 12
9            C            C 10

Since this question was posted, dplyr added scoped versions of group_by (documentation here). This lets you use the same functions you would use with select, like so:

data = data.frame(
    asihckhdoydkhxiydfgfTgdsx = sample(LETTERS[1:3], 100, replace=TRUE),
    a30mvxigxkghc5cdsvxvyv0ja = sample(LETTERS[1:3], 100, replace=TRUE),
    value = rnorm(100)
)

# get the columns we want to average within
columns = names(data)[-3]

library(dplyr)
df1 <- data %>%
  group_by_at(vars(one_of(columns))) %>%
  summarize(Value = mean(value))

#compare plyr for reference
df2 <- plyr::ddply(data, columns, plyr::summarize, value=mean(value))
table(df1 == df2, useNA = 'ifany')
## TRUE 
##  27

The output from your example question is as expected (see comparison to plyr above and output below):

# A tibble: 9 x 3
# Groups:   asihckhdoydkhxiydfgfTgdsx [?]
  asihckhdoydkhxiydfgfTgdsx a30mvxigxkghc5cdsvxvyv0ja       Value
                     <fctr>                    <fctr>       <dbl>
1                         A                         A  0.04095002
2                         A                         B  0.24943935
3                         A                         C -0.25783892
4                         B                         A  0.15161805
5                         B                         B  0.27189974
6                         B                         C  0.20858897
7                         C                         A  0.19502221
8                         C                         B  0.56837548
9                         C                         C -0.22682998

Note that since dplyr::summarize only strips off one layer of grouping at a time, you've still got some grouping going on in the resultant tibble (which can sometime catch people by suprise later down the line). If you want to be absolutely safe from unexpected grouping behavior, you can always add %>% ungroup to your pipeline after you summarize.

The support for this in dplyr is currently pretty weak, eventually I think the syntax will be something like:

df %.% group_by(.groups = c("asdfgfTgdsx", "asdfk30v0ja"))

But that probably won't be there for a while (because I need to think through all the consequences).

In the meantime, you can use regroup(), which takes a list of symbols:

library(dplyr)

df <-  data.frame(
  asihckhdoydk = sample(LETTERS[1:3], 100, replace=TRUE),
  a30mvxigxkgh = sample(LETTERS[1:3], 100, replace=TRUE),
  value = rnorm(100)
)

df %.%
  regroup(list(quote(asihckhdoydk), quote(a30mvxigxkgh))) %.%
  summarise(n = n())

If you have have a character vector of column names, you can convert them to the right structure with lapply() and as.symbol():

vars <- setdiff(names(df), "value")
vars2 <- lapply(vars, as.symbol)

df %.% regroup(vars2) %.% summarise(n = n())

String specification of columns in dplyr are now supported through variants of the dplyr functions with names finishing in an underscore. For example, corresponding to the group_by function there is a group_by_ function that may take string arguments. This vignette describes the syntax of these functions in detail.

The following snippet cleanly solves the problem that @sharoz originally posed (note the need to write out the .dots argument):

# Given data and columns from the OP

data %>%
    group_by_(.dots = columns) %>%
    summarise(Value = mean(value))

(Note that dplyr now uses the %>% operator, and %.% is deprecated).

Until dplyr has full support for string arguments, perhaps this gist is useful:

https://gist.github.com/skranz/9681509

It contains bunch of wrapper functions like s_group_by, s_mutate, s_filter, etc that use string arguments. You can mix them with the normal dplyr functions. For example

cols = c("cyl","gear")
mtcars %.%
  s_group_by(cols) %.%  
  s_summarise("avdisp=mean(disp), max(disp)") %.%
  arrange(avdisp)

Related questions
                            
                                Painless way to install a new version of R?
                            
                                How to format a number as percentage in R?
                            
                                Why were pandas merges in python faster than data.table merges in R in 2012?
                            
                                Add a common Legend for combined ggplots
                            
                                ggplot2 plot without axes, legends, etc
                            
                                Better explanation of when to use Imports/Depends
                            
                                Fastest way to replace NAs in a large data.table
                            
                                Global variables in R
                            
                                Insert picture/table in R Markdown [closed]
                            
                                How to generate a number of most distinctive colors in R?
                            
                                Check for installed packages before running install.packages() [duplicate]
                            
                                Is there a way to make R beep/play a sound at the end of a script?
                            
                                promise already under evaluation: recursive default argument reference or earlier problems?
                            
                                Count number of occurences for each unique value
                            
                                Determining memory usage of objects?
                            
                                What does %>% function mean in R?
                            
                                Returning multiple objects in an R function [duplicate]
                            
                                Summarizing multiple columns with dplyr? [duplicate]
                            
                                How to interpret dplyr message `summarise()` regrouping output by 'x' (override with `.groups` argument)?
                            
                                Select rows of a matrix that meet a condition

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Group by multiple columns in dplyr, using string vector input

Tags:

r

r-faq

dplyr

People also ask

Recent Activity

Donate For Us