This question is appropriate to understand the right functionality of the group_by function.
Suppose that I have a dataframe with 5 binary variables (the meaning of these variables isn't important) and one variable id representing some users. For example:
id<- c("A","A" , "B" , "B")
d<- as.data.frame(id)
d$d1<- c(1,0,1,0)
d$d2<- c(1,0,1,0)
d$d3<- c(0,1,1,0)
d$d4<- c(0,1,0,1)
d$d5<- c(0,1,0,0)
> d
id d1 d2 d3 d4 d5
1 A 1 1 0 0 0
2 A 0 0 1 1 1
3 B 1 1 1 0 0
4 B 0 0 0 1 0
I will construct a function able to check that for each user, A and B, the variables d1 to d5 contain 1 in all of them.
verificator<- function(d )
{
r<- prod(apply(d[,2:6],2, sum))
r<- as.logical(r)
return(r)
}
for example, for the A user, for each d1 to d5, there is the number one in all of them.
verificator(d[1:2,])
[1] TRUE
But, for the B user, we have
verificator(d[3:4,])
[1] FALSE
When I use the dplyr function to evaluate the d matrix, there is something wrong:
d2<- d %>% group_by(id) %>% summarise(one = verificator(.))
d2
Source: local data frame [2 x 2]
id one
1 A TRUE
2 B TRUE
Why does this return TRUE for the B user?
If we need to get the expected output, one option is
d %>%
group_by(id) %>%
summarise_each(funs(sum)) %>% rowwise() %>%
do(data.frame(id = .[1L], one = as.logical(prod(unlist(.[-1])))))
# id one
# <fctr> <lgl>
#1 A TRUE
#2 B FALSE
We can also do this using by
from base R
verificator <- function(x){
as.logical(prod(colSums(x)))
}
c(by(d[-1], d$id, FUN = verificator))
# A B
#TRUE FALSE
The reason that you get a wrong result is that when using %>%
, the dot (.
) stands for the compete result of the expression on the left of %>%
. Therefore, you are simply evaluating your verificator()
twice on the complete data frame d
.
You can see this as follows. First, I check that verificator()
applied to the complete data frame indeed returns TRUE
:
verificator(d)
## [1] TRUE
Then, I define another variant of verificator()
that prints its argument:
verificator_p <- function(d) {
print(d)
return(verificator(d))
}
Using the code that you proposed, shows that the it is always the full data frame that is passed to the function:
d %>% group_by(id) %>% summarise(one = verificator_p(.))
## Source: local data frame [4 x 6]
## Groups: id [2]
##
## id d1 d2 d3 d4 d5
## (fctr) (dbl) (dbl) (dbl) (dbl) (dbl)
## 1 A 1 1 0 0 0
## 2 A 0 0 1 1 1
## 3 B 1 1 1 0 0
## 4 B 0 0 0 1 0
## Source: local data frame [4 x 6]
## Groups: id [2]
##
## id d1 d2 d3 d4 d5
## (fctr) (dbl) (dbl) (dbl) (dbl) (dbl)
## 1 A 1 1 0 0 0
## 2 A 0 0 1 1 1
## 3 B 1 1 1 0 0
## 4 B 0 0 0 1 0
## Source: local data frame [4 x 6]
## Groups: id [2]
##
## id d1 d2 d3 d4 d5
## (fctr) (dbl) (dbl) (dbl) (dbl) (dbl)
## 1 A 1 1 0 0 0
## 2 A 0 0 1 1 1
## 3 B 1 1 1 0 0
## 4 B 0 0 0 1 0
## Source: local data frame [2 x 2]
##
## id one
## (fctr) (lgl)
## 1 A TRUE
## 2 B TRUE
What I admittedly don't know is, why d
is printed three times and not just twice...
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With