Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

dplyr summarise: Equivalent of ".drop=FALSE" to keep groups with zero length in output

Tags:

r

dplyr

plyr

tidyr

When using summarise with plyr's ddply function, empty categories are dropped by default. You can change this behavior by adding .drop = FALSE. However, this doesn't work when using summarise with dplyr. Is there another way to keep empty categories in the result?

Here's an example with fake data.

library(dplyr)  df = data.frame(a=rep(1:3,4), b=rep(1:2,6))  # Now add an extra level to df$b that has no corresponding value in df$a df$b = factor(df$b, levels=1:3)  # Summarise with plyr, keeping categories with a count of zero plyr::ddply(df, "b", summarise, count_a=length(a), .drop=FALSE)    b    count_a 1 1    6 2 2    6 3 3    0  # Now try it with dplyr df %.%   group_by(b) %.%   summarise(count_a=length(a), .drop=FALSE)    b     count_a .drop 1 1     6       FALSE 2 2     6       FALSE 

Not exactly what I was hoping for. Is there a dplyr method for achieving the same result as .drop=FALSE in plyr?

like image 619
eipi10 Avatar asked Mar 20 '14 03:03

eipi10


People also ask

What is N () in Dplyr?

n() gives the current group size. cur_data() gives the current data for the current group (excluding grouping variables).

What does Dplyr Summarise do?

summarise() creates a new data frame. It will have one (or more) rows for each combination of grouping variables; if there are no grouping variables, the output will have a single row summarising all observations in the input.

Which Dplyr function is used to reduce multiple values to a single value?

The summarise() function will reduce a data frame by summarizing values in one or multiple columns.

What does group by do in Dplyr?

group_by() takes an existing tbl and converts it into a grouped tbl where operations are performed "by group". ungroup() removes grouping.


2 Answers

The issue is still open, but in the meantime, especially since your data are already factored, you can use complete from "tidyr" to get what you might be looking for:

library(tidyr) df %>%   group_by(b) %>%   summarise(count_a=length(a)) %>%   complete(b) # Source: local data frame [3 x 2] #  #        b count_a #   (fctr)   (int) # 1      1       6 # 2      2       6 # 3      3      NA 

If you wanted the replacement value to be zero, you need to specify that with fill:

df %>%   group_by(b) %>%   summarise(count_a=length(a)) %>%   complete(b, fill = list(count_a = 0)) # Source: local data frame [3 x 2] #  #        b count_a #   (fctr)   (dbl) # 1      1       6 # 2      2       6 # 3      3       0 
like image 71
A5C1D2H2I1M1N2O1R2T1 Avatar answered Oct 01 '22 14:10

A5C1D2H2I1M1N2O1R2T1


Since dplyr 0.8 group_by gained the .drop argument that does just what you asked for:

df = data.frame(a=rep(1:3,4), b=rep(1:2,6)) df$b = factor(df$b, levels=1:3)  df %>%   group_by(b, .drop=FALSE) %>%   summarise(count_a=length(a))  #> # A tibble: 3 x 2 #>   b     count_a #>   <fct>   <int> #> 1 1           6 #> 2 2           6 #> 3 3           0 

One additional note to go with @Moody_Mudskipper's answer: Using .drop=FALSE can give potentially unexpected results when one or more grouping variables are not coded as factors. See examples below:

library(dplyr) data(iris)  # Add an additional level to Species iris$Species = factor(iris$Species, levels=c(levels(iris$Species), "empty_level"))  # Species is a factor and empty groups are included in the output iris %>% group_by(Species, .drop=FALSE) %>% tally  #>   Species         n #> 1 setosa         50 #> 2 versicolor     50 #> 3 virginica      50 #> 4 empty_level     0  # Add character column iris$group2 = c(rep(c("A","B"), 50), rep(c("B","C"), each=25))  # Empty groups involving combinations of Species and group2 are not included in output iris %>% group_by(Species, group2, .drop=FALSE) %>% tally  #>   Species     group2     n #> 1 setosa      A         25 #> 2 setosa      B         25 #> 3 versicolor  A         25 #> 4 versicolor  B         25 #> 5 virginica   B         25 #> 6 virginica   C         25 #> 7 empty_level <NA>       0  # Turn group2 into a factor iris$group2 = factor(iris$group2)  # Now all possible combinations of Species and group2 are included in the output,  #  whether present in the data or not iris %>% group_by(Species, group2, .drop=FALSE) %>% tally  #>    Species     group2     n #>  1 setosa      A         25 #>  2 setosa      B         25 #>  3 setosa      C          0 #>  4 versicolor  A         25 #>  5 versicolor  B         25 #>  6 versicolor  C          0 #>  7 virginica   A          0 #>  8 virginica   B         25 #>  9 virginica   C         25 #> 10 empty_level A          0 #> 11 empty_level B          0 #> 12 empty_level C          0  Created on 2019-03-13 by the reprex package (v0.2.1) 
like image 40
Moody_Mudskipper Avatar answered Oct 01 '22 12:10

Moody_Mudskipper