Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

dplyr::mutate gives x/y = NA, summarise gives x/y = real number

Tags:

r

dplyr

I'm working on validating a function to calculate pass rates for a certain criterion in my lab. The mathematics behind this are very simple: Given a number of tests that either passed or failed, what percentage passed.

The data will be provided as a column of values that are either P1 (passed on first test), F1 (failed on first test), P2 or F2 (passed or failed on second test, respectively). I wrote the function passRate below to assist in calculating the pass rates overall (first and second try) and on the first test and second test in isolation.

The quality specialist who set up the parameters for the validation gave me a list of pass and fail counts that I am converting into a vector using the test_vector function below.

Everything was looking great until I got to the third row of the Pass data frame, which contains the pass/fail counts from my quality specialist. Instead of returning a second test pass rate of 100%, it returns NA...but only when I use mutate

library(dplyr)

Pass <- structure(list(P1 = c(2L, 0L, 10L), 
                       F1 = c(0L, 2L, 0L), 
                       P2 = c(0L, 3L, 2L), 
                       F2 = c(0L, 2L, 0L), 
                       id = 1:3), 
                  .Names = c("P1", "F1", "P2", "F2", "id"), 
                  class = c("tbl_df", "data.frame"), 
                  row.names = c(NA, -3L))

So here's something akin to what I did with mutate.

Pass %>%
  group_by(id) %>%
  mutate(pass_rate = (P1 + P2) / (P1 + P2 + F1 + F2) * 100,
         pass_rate1 = P1 / (P1 + F1) * 100,
         pass_rate2 = P2 / (P2 + F2) * 100)

Source: local data frame [3 x 8]
Groups: id [3]

     P1    F1    P2    F2    id pass_rate pass_rate1 pass_rate2
  (int) (int) (int) (int) (int)     (dbl)      (dbl)      (dbl)
1     2     0     0     0     1 100.00000        100         NA
2     0     2     3     2     2  42.85714          0         60
3    10     0     3     1     3 100.00000        100         NA

Compare when I use summarise

Pass %>%
  group_by(id) %>%
  summarise(pass_rate = (P1 + P2) / (P1 + P2 + F1 + F2) * 100,
            pass_rate1 = P1 / (P1 + F1) * 100,
            pass_rate2 = P2 / (P2 + F2) * 100)

Source: local data frame [3 x 4]

     id pass_rate pass_rate1 pass_rate2
  (int)     (dbl)      (dbl)      (dbl)
1     1 100.00000        100         NA
2     2  42.85714          0         60
3     3 100.00000        100        100

I would have expected these to return the same results. My guess is that mutate is having problems somewhere because it assumes n rows per group should map to n rows in the result (is it getting confused in calculating n here?), while summarise knows that no matter how many rows it starts with, it's going to end with just 1.

Does anyone have any thoughts on what the mechanics behind this behavior are?

like image 752
Benjamin Avatar asked Oct 20 '22 00:10

Benjamin


1 Answers

It seems to me like some interference between dplyr and plyr. I had the same issue with another unbalanced dataset (so grouping was necessary), where exactly in the third group the mutated variable was erroneously NA! Then I reproduced your example at home. First, after

library("dplyr", lib.loc="~/R/x86_64-pc-linux-gnu-library/3.2")

I got exactly your results. Then I executed my own script, where the package plyr had been loaded. After the warning not to load plyr after dplyr, the NA in my third group was gone and also your example was computed correctly! Here is what I did (I added one more row to see if NA remains in the third group):

> Pass <- structure(list(P1 = c(2L, 0L, 10L,8L), 
+                        F1 = c(0L, 2L, 0L, 4L), 
+                        P2 = c(0L, 3L, 2L, 2L), 
+                        F2 = c(0L, 2L, 0L, 1L), 
+                        id = 1:4), 
+                   .Names = c("P1", "F1", "P2", "F2", "id"), 
+                   class = c("tbl_df", "data.frame"), 
+                   row.names = c(NA, -4L))
> Pass %>%
+     group_by(id) %>%
+     mutate(pass_rate = (P1 + P2) / (P1 + P2 + F1 + F2) * 100,
+            pass_rate1 = P1 / (P1 + F1) * 100,
+            pass_rate2 = P2 / (P2 + F2) * 100)
Source: local data frame [4 x 8]
Groups: id [4]

 P1    F1    P2    F2    id pass_rate pass_rate1 pass_rate2
(int) (int) (int) (int) (int)     (dbl)      (dbl)      (dbl)
 1     2     0     0     0     1 100.00000  100.00000         NA
 2     0     2     3     2     2  42.85714    0.00000   60.00000
 3    10     0     2     0     3 100.00000  100.00000         NA
 4     8     4     2     1     4  66.66667   66.66667   66.66667

Then I did:

> library("plyr", lib.loc="~/R/x86_64-pc-linux-gnu-library/3.2")
> Pass %>%
+     group_by(id) %>%
+     mutate(pass_rate = (P1 + P2) / (P1 + P2 + F1 + F2) * 100,
+            pass_rate1 = P1 / (P1 + F1) * 100,
+            pass_rate2 = P2 / (P2 + F2) * 100)
Source: local data frame [4 x 8]
Groups: id [4]

 P1    F1    P2    F2    id pass_rate pass_rate1 pass_rate2
(int) (int) (int) (int) (int)     (dbl)      (dbl)      (dbl)
 1     2     0     0     0     1 100.00000  100.00000        NaN
 2     0     2     3     2     2  42.85714    0.00000   60.00000
 3    10     0     2     0     3 100.00000  100.00000  100.00000
 4     8     4     2     1     4  66.66667   66.66667   66.66667

I know that it is not a satisfying answer because plyr should NOT be loaded after dplyr, but maybe it helps those out who need to group_by(id). Or use plyr::mutate(). Then you can load dplyr after plyr:

 > Pass %>%
+     group_by(id) %>%
+     plyr::mutate(pass_rate = (P1 + P2) / (P1 + P2 + F1 + F2) * 100,
+            pass_rate1 = P1 / (P1 + F1) * 100,
+            pass_rate2 = P2 / (P2 + F2) * 100)
Source: local data frame [4 x 8]
Groups: id [4]

 P1    F1    P2    F2    id pass_rate pass_rate1 pass_rate2
(int) (int) (int) (int) (int)     (dbl)      (dbl)      (dbl)
 1     2     0     0     0     1 100.00000  100.00000        NaN
 2     0     2     3     2     2  42.85714    0.00000   60.00000
 3    10     0     2     0     3 100.00000  100.00000  100.00000
 4     8     4     2     1     4  66.66667   66.66667   66.66667
like image 130
Carsten Oppitz Avatar answered Oct 21 '22 21:10

Carsten Oppitz