Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Proper idiom for adding zero count rows in tidyr/dplyr

Tags:

r

dplyr

tidyr

Suppose I have some count data that looks like this:

library(tidyr) library(dplyr)  X.raw <- data.frame(   x = as.factor(c("A", "A", "A", "B", "B", "B")),   y = as.factor(c("i", "ii", "ii", "i", "i", "i")),   z = 1:6 ) X.raw #   x  y z # 1 A  i 1 # 2 A ii 2 # 3 A ii 3 # 4 B  i 4 # 5 B  i 5 # 6 B  i 6 

I'd like to tidy and summarise like this:

X.tidy <- X.raw %>% group_by(x, y) %>% summarise(count = sum(z)) X.tidy # Source: local data frame [3 x 3] # Groups: x # #   x  y count # 1 A  i     1 # 2 A ii     5 # 3 B  i    15 

I know that for x=="B" and y=="ii" we have observed count of zero, rather than a missing value. i.e. the field worker was actually there, but because there wasn't a positive count no row was entered into the raw data. I can add the zero count explicitly by doing this:

X.fill <- X.tidy %>% spread(y, count, fill = 0) %>% gather(y, count, -x) X.fill # Source: local data frame [4 x 3] #  #   x  y count # 1 A  i     1 # 2 B  i    15 # 3 A ii     5 # 4 B ii     0 

But that seems a little bit of a roundabout way of doing things. Is there a cleaner idiom for this?

Just to clarify: My code already does what I need it to do, using spread then gather, so what I'm interested in is finding a more direct route within tidyr and dplyr.

like image 980
pete Avatar asked Sep 21 '14 05:09

pete


2 Answers

Since dplyr 0.8 you can do it by setting the parameter .drop = FALSE in group_by:

X.tidy <- X.raw %>% group_by(x, y, .drop = FALSE) %>% summarise(count=sum(z)) X.tidy # # A tibble: 4 x 3 # # Groups:   x [2] #   x     y     count #   <fct> <fct> <int> # 1 A     i         1 # 2 A     ii        5 # 3 B     i        15 # 4 B     ii        0 
like image 198
Moody_Mudskipper Avatar answered Oct 11 '22 23:10

Moody_Mudskipper


The complete function from tidyr is made for just this situation.

From the docs:

This is a wrapper around expand(), left_join() and replace_na that's useful for completing missing combinations of data.

You could use it in two ways. First, you could use it on the original dataset before summarizing, "completing" the dataset with all combinations of x and y, and filling z with 0 (you could use the default NA fill and use na.rm = TRUE in sum).

X.raw %>%      complete(x, y, fill = list(z = 0)) %>%      group_by(x,y) %>%      summarise(count = sum(z))  Source: local data frame [4 x 3] Groups: x [?]         x      y count   <fctr> <fctr> <dbl> 1      A      i     1 2      A     ii     5 3      B      i    15 4      B     ii     0 

You can also use complete on your pre-summarized dataset. Note that complete respects grouping. X.tidy is grouped, so you can either ungroup and complete the dataset by x and y or just list the variable you want completed within each group - in this case, y.

# Complete after ungrouping X.tidy %>%      ungroup %>%     complete(x, y, fill = list(count = 0))  # Complete within grouping X.tidy %>%      complete(y, fill = list(count = 0)) 

The result is the same for each option:

Source: local data frame [4 x 3]         x      y count   <fctr> <fctr> <dbl> 1      A      i     1 2      A     ii     5 3      B      i    15 4      B     ii     0 
like image 28
aosmith Avatar answered Oct 11 '22 23:10

aosmith