I'm working on validating a function to calculate pass rates for a certain criterion in my lab. The mathematics behind this are very simple: Given a number of tests that either passed or failed, what percentage passed.
The data will be provided as a column of values that are either P1
(passed on first test), F1
(failed on first test), P2
or F2
(passed or failed on second test, respectively). I wrote the function passRate
below to assist in calculating the pass rates overall (first and second try) and on the first test and second test in isolation.
The quality specialist who set up the parameters for the validation gave me a list of pass and fail counts that I am converting into a vector using the test_vector
function below.
Everything was looking great until I got to the third row of the Pass
data frame, which contains the pass/fail counts from my quality specialist. Instead of returning a second test pass rate of 100%, it returns NA...but only when I use mutate
library(dplyr)
Pass <- structure(list(P1 = c(2L, 0L, 10L),
F1 = c(0L, 2L, 0L),
P2 = c(0L, 3L, 2L),
F2 = c(0L, 2L, 0L),
id = 1:3),
.Names = c("P1", "F1", "P2", "F2", "id"),
class = c("tbl_df", "data.frame"),
row.names = c(NA, -3L))
So here's something akin to what I did with mutate
.
Pass %>%
group_by(id) %>%
mutate(pass_rate = (P1 + P2) / (P1 + P2 + F1 + F2) * 100,
pass_rate1 = P1 / (P1 + F1) * 100,
pass_rate2 = P2 / (P2 + F2) * 100)
Source: local data frame [3 x 8]
Groups: id [3]
P1 F1 P2 F2 id pass_rate pass_rate1 pass_rate2
(int) (int) (int) (int) (int) (dbl) (dbl) (dbl)
1 2 0 0 0 1 100.00000 100 NA
2 0 2 3 2 2 42.85714 0 60
3 10 0 3 1 3 100.00000 100 NA
Compare when I use summarise
Pass %>%
group_by(id) %>%
summarise(pass_rate = (P1 + P2) / (P1 + P2 + F1 + F2) * 100,
pass_rate1 = P1 / (P1 + F1) * 100,
pass_rate2 = P2 / (P2 + F2) * 100)
Source: local data frame [3 x 4]
id pass_rate pass_rate1 pass_rate2
(int) (dbl) (dbl) (dbl)
1 1 100.00000 100 NA
2 2 42.85714 0 60
3 3 100.00000 100 100
I would have expected these to return the same results. My guess is that mutate
is having problems somewhere because it assumes n
rows per group should map to n
rows in the result (is it getting confused in calculating n
here?), while summarise
knows that no matter how many rows it starts with, it's going to end with just 1.
Does anyone have any thoughts on what the mechanics behind this behavior are?
It seems to me like some interference between dplyr
and plyr
. I had the same issue with another unbalanced dataset (so grouping was necessary), where exactly in the third group the mutated variable was erroneously NA! Then I reproduced your example at home. First, after
library("dplyr", lib.loc="~/R/x86_64-pc-linux-gnu-library/3.2")
I got exactly your results. Then I executed my own script, where the package plyr
had been loaded. After the warning not to load plyr
after dplyr
, the NA in my third group was gone and also your example was computed correctly! Here is what I did (I added one more row to see if NA remains in the third group):
> Pass <- structure(list(P1 = c(2L, 0L, 10L,8L),
+ F1 = c(0L, 2L, 0L, 4L),
+ P2 = c(0L, 3L, 2L, 2L),
+ F2 = c(0L, 2L, 0L, 1L),
+ id = 1:4),
+ .Names = c("P1", "F1", "P2", "F2", "id"),
+ class = c("tbl_df", "data.frame"),
+ row.names = c(NA, -4L))
> Pass %>%
+ group_by(id) %>%
+ mutate(pass_rate = (P1 + P2) / (P1 + P2 + F1 + F2) * 100,
+ pass_rate1 = P1 / (P1 + F1) * 100,
+ pass_rate2 = P2 / (P2 + F2) * 100)
Source: local data frame [4 x 8]
Groups: id [4]
P1 F1 P2 F2 id pass_rate pass_rate1 pass_rate2
(int) (int) (int) (int) (int) (dbl) (dbl) (dbl)
1 2 0 0 0 1 100.00000 100.00000 NA
2 0 2 3 2 2 42.85714 0.00000 60.00000
3 10 0 2 0 3 100.00000 100.00000 NA
4 8 4 2 1 4 66.66667 66.66667 66.66667
Then I did:
> library("plyr", lib.loc="~/R/x86_64-pc-linux-gnu-library/3.2")
> Pass %>%
+ group_by(id) %>%
+ mutate(pass_rate = (P1 + P2) / (P1 + P2 + F1 + F2) * 100,
+ pass_rate1 = P1 / (P1 + F1) * 100,
+ pass_rate2 = P2 / (P2 + F2) * 100)
Source: local data frame [4 x 8]
Groups: id [4]
P1 F1 P2 F2 id pass_rate pass_rate1 pass_rate2
(int) (int) (int) (int) (int) (dbl) (dbl) (dbl)
1 2 0 0 0 1 100.00000 100.00000 NaN
2 0 2 3 2 2 42.85714 0.00000 60.00000
3 10 0 2 0 3 100.00000 100.00000 100.00000
4 8 4 2 1 4 66.66667 66.66667 66.66667
I know that it is not a satisfying answer because plyr
should NOT be loaded after dplyr
, but maybe it helps those out who need to group_by(id)
. Or use plyr::mutate()
. Then you can load dplyr
after plyr
:
> Pass %>%
+ group_by(id) %>%
+ plyr::mutate(pass_rate = (P1 + P2) / (P1 + P2 + F1 + F2) * 100,
+ pass_rate1 = P1 / (P1 + F1) * 100,
+ pass_rate2 = P2 / (P2 + F2) * 100)
Source: local data frame [4 x 8]
Groups: id [4]
P1 F1 P2 F2 id pass_rate pass_rate1 pass_rate2
(int) (int) (int) (int) (int) (dbl) (dbl) (dbl)
1 2 0 0 0 1 100.00000 100.00000 NaN
2 0 2 3 2 2 42.85714 0.00000 60.00000
3 10 0 2 0 3 100.00000 100.00000 100.00000
4 8 4 2 1 4 66.66667 66.66667 66.66667
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With