Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R dplyr's group_by consider empty groups as well

Tags:

r

group-by

dplyr

Let's consider the following data frame:

set.seed(123)
data <- data.frame(col1 = factor(rep(c("A", "B", "C"), 4)),
                   col2 = factor(c(rep(c("A", "B", "C"), 3), c("A", "A", "A"))),
                   val1 = 1:12,
                   val2 = rnorm(12, 10, 15))

The contingency table is as follows:

cont_tab <- table(data$col1, data$col2, dnn = c("col1", "col2"))

cont_tab

    col2
col1 A B C
   A 4 0 0
   B 1 3 0
   C 1 0 3

As you can see some pairs didn't occur: (A,B), (A,C), (B,C), (C,B). The end goal of my analysis is to list all of the pairs (in this case 9) and show a statistic for each of them. While using dplyr::group_by() function I hit a limitation. Namely, the dplyr::group_by() considers only existing pairs (pairs that occured at least once):

data %>%
  group_by(col1, col2) %>%
  summarize(stat = sum(val2) - sum(val1))

# A tibble: 5 x 3
# Groups:   col1 [?]
  col1  col2   stat
  <fct> <fct> <dbl>
1 A     A      58.1
2 B     A     -16.4
3 B     B      17.0
4 C     A     -12.9
5 C     C     -41.9

The output I have in mind has 9 rows (4 of which has stat equal to 0). Is it doable in dplyr?

EDIT: Sorry for being too vague at the beginning. The real problem is more complex than counting the number of times a particular pair occurs. I added the new data in order to make the real problem more visible.

like image 327
balkon16 Avatar asked Oct 30 '25 22:10

balkon16


1 Answers

It is much easier to add spread from tidyr to get the same result as with table

library(dplyr)
library(tidyr)
count(data, col1, col2) %>% 
      spread(col2, n, fill = 0)
# A tibble: 3 x 4
# Groups:   col1 [3]
#  col1      A     B     C
#  <fct> <dbl> <dbl> <dbl>
#1 A         4     0     0
#2 B         1     3     0
#3 C         1     0     3

NOTE: The group_by/summarise step is changed to count here

As @divibisan suggested, if the OP wanted long format, then add gather at the end

data %>%
   group_by(col1, col2) %>%
   summarize(stat = n()) %>%
   spread(col2, stat, fill = 0) %>%
   gather(col2, stat, A:C)
# A tibble: 9 x 3
# Groups:   col1 [3]
#  col1  col2   stat
#  <fct> <chr> <dbl>
#1 A     A         4
#2 B     A         1
#3 C     A         1
#4 A     B         0
#5 B     B         3
#6 C     B         0
#7 A     C         0
#8 B     C         0
#9 C     C         3

Update

With the updated data in OP's post

data %>%
   group_by(col1, col2) %>%
   summarize(stat = sum(val2) - sum(val1)) %>% 
   spread(col2, stat, fill = 0)  %>% 
   gather(col2, stat, -1)
# A tibble: 9 x 3
# Groups:   col1 [3]
#  col1  col2    stat
#  <fct> <chr>  <dbl>
#1 A     A       7.76
#2 B     A     -20.8 
#3 C     A       6.97
#4 A     B       0   
#5 B     B      28.8 
#6 C     B       0   
#7 A     C       0   
#8 B     C       0   
#9 C     C       9.56
like image 199
akrun Avatar answered Nov 02 '25 12:11

akrun



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!