When using summarise
with plyr
's ddply
function, empty categories are dropped by default. You can change this behavior by adding .drop = FALSE
. However, this doesn't work when using summarise
with dplyr
. Is there another way to keep empty categories in the result?
Here's an example with fake data.
library(dplyr) df = data.frame(a=rep(1:3,4), b=rep(1:2,6)) # Now add an extra level to df$b that has no corresponding value in df$a df$b = factor(df$b, levels=1:3) # Summarise with plyr, keeping categories with a count of zero plyr::ddply(df, "b", summarise, count_a=length(a), .drop=FALSE) b count_a 1 1 6 2 2 6 3 3 0 # Now try it with dplyr df %.% group_by(b) %.% summarise(count_a=length(a), .drop=FALSE) b count_a .drop 1 1 6 FALSE 2 2 6 FALSE
Not exactly what I was hoping for. Is there a dplyr
method for achieving the same result as .drop=FALSE
in plyr
?
n() gives the current group size. cur_data() gives the current data for the current group (excluding grouping variables).
summarise() creates a new data frame. It will have one (or more) rows for each combination of grouping variables; if there are no grouping variables, the output will have a single row summarising all observations in the input.
The summarise() function will reduce a data frame by summarizing values in one or multiple columns.
group_by() takes an existing tbl and converts it into a grouped tbl where operations are performed "by group". ungroup() removes grouping.
The issue is still open, but in the meantime, especially since your data are already factored, you can use complete
from "tidyr" to get what you might be looking for:
library(tidyr) df %>% group_by(b) %>% summarise(count_a=length(a)) %>% complete(b) # Source: local data frame [3 x 2] # # b count_a # (fctr) (int) # 1 1 6 # 2 2 6 # 3 3 NA
If you wanted the replacement value to be zero, you need to specify that with fill
:
df %>% group_by(b) %>% summarise(count_a=length(a)) %>% complete(b, fill = list(count_a = 0)) # Source: local data frame [3 x 2] # # b count_a # (fctr) (dbl) # 1 1 6 # 2 2 6 # 3 3 0
Since dplyr 0.8 group_by
gained the .drop
argument that does just what you asked for:
df = data.frame(a=rep(1:3,4), b=rep(1:2,6)) df$b = factor(df$b, levels=1:3) df %>% group_by(b, .drop=FALSE) %>% summarise(count_a=length(a)) #> # A tibble: 3 x 2 #> b count_a #> <fct> <int> #> 1 1 6 #> 2 2 6 #> 3 3 0
One additional note to go with @Moody_Mudskipper's answer: Using .drop=FALSE
can give potentially unexpected results when one or more grouping variables are not coded as factors. See examples below:
library(dplyr) data(iris) # Add an additional level to Species iris$Species = factor(iris$Species, levels=c(levels(iris$Species), "empty_level")) # Species is a factor and empty groups are included in the output iris %>% group_by(Species, .drop=FALSE) %>% tally #> Species n #> 1 setosa 50 #> 2 versicolor 50 #> 3 virginica 50 #> 4 empty_level 0 # Add character column iris$group2 = c(rep(c("A","B"), 50), rep(c("B","C"), each=25)) # Empty groups involving combinations of Species and group2 are not included in output iris %>% group_by(Species, group2, .drop=FALSE) %>% tally #> Species group2 n #> 1 setosa A 25 #> 2 setosa B 25 #> 3 versicolor A 25 #> 4 versicolor B 25 #> 5 virginica B 25 #> 6 virginica C 25 #> 7 empty_level <NA> 0 # Turn group2 into a factor iris$group2 = factor(iris$group2) # Now all possible combinations of Species and group2 are included in the output, # whether present in the data or not iris %>% group_by(Species, group2, .drop=FALSE) %>% tally #> Species group2 n #> 1 setosa A 25 #> 2 setosa B 25 #> 3 setosa C 0 #> 4 versicolor A 25 #> 5 versicolor B 25 #> 6 versicolor C 0 #> 7 virginica A 0 #> 8 virginica B 25 #> 9 virginica C 25 #> 10 empty_level A 0 #> 11 empty_level B 0 #> 12 empty_level C 0 Created on 2019-03-13 by the reprex package (v0.2.1)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With