Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

In R, how does group_by in dplyr work?

Tags:

r

dplyr

This question is appropriate to understand the right functionality of the group_by function.

Suppose that I have a dataframe with 5 binary variables (the meaning of these variables isn't important) and one variable id representing some users. For example:

id<- c("A","A" , "B" , "B")
d<- as.data.frame(id) 
d$d1<- c(1,0,1,0)
d$d2<- c(1,0,1,0)
d$d3<- c(0,1,1,0)
d$d4<- c(0,1,0,1)
d$d5<- c(0,1,0,0)
> d
  id d1 d2 d3 d4 d5
1  A  1  1  0  0  0
2  A  0  0  1  1  1
3  B  1  1  1  0  0
4  B  0  0  0  1  0

I will construct a function able to check that for each user, A and B, the variables d1 to d5 contain 1 in all of them.

verificator<- function(d )
 {
  r<- prod(apply(d[,2:6],2, sum)) 
  r<- as.logical(r)
  return(r)
 } 

for example, for the A user, for each d1 to d5, there is the number one in all of them.

verificator(d[1:2,]) 
[1] TRUE

But, for the B user, we have

verificator(d[3:4,])
[1] FALSE

When I use the dplyr function to evaluate the d matrix, there is something wrong:

d2<- d %>% group_by(id) %>% summarise(one = verificator(.))
d2
Source: local data frame [2 x 2]

  id  one
1  A TRUE
2  B TRUE

Why does this return TRUE for the B user?

like image 818
Vasco Avatar asked Mar 11 '23 21:03

Vasco


2 Answers

If we need to get the expected output, one option is

d %>% 
    group_by(id) %>% 
    summarise_each(funs(sum)) %>% rowwise()  %>% 
    do(data.frame(id = .[1L], one = as.logical(prod(unlist(.[-1])))))
#     id   one
#  <fctr> <lgl>
#1      A  TRUE
#2      B FALSE

We can also do this using by from base R

verificator <- function(x){
     as.logical(prod(colSums(x)))
    }
c(by(d[-1], d$id, FUN = verificator))
#   A     B 
#TRUE FALSE 
like image 77
akrun Avatar answered Mar 23 '23 11:03

akrun


The reason that you get a wrong result is that when using %>%, the dot (.) stands for the compete result of the expression on the left of %>%. Therefore, you are simply evaluating your verificator() twice on the complete data frame d.

You can see this as follows. First, I check that verificator() applied to the complete data frame indeed returns TRUE:

verificator(d)
## [1] TRUE

Then, I define another variant of verificator() that prints its argument:

verificator_p <- function(d) {
  print(d)
  return(verificator(d))
}

Using the code that you proposed, shows that the it is always the full data frame that is passed to the function:

d %>% group_by(id) %>% summarise(one = verificator_p(.))
## Source: local data frame [4 x 6]
## Groups: id [2]
## 
##       id    d1    d2    d3    d4    d5
##   (fctr) (dbl) (dbl) (dbl) (dbl) (dbl)
## 1      A     1     1     0     0     0
## 2      A     0     0     1     1     1
## 3      B     1     1     1     0     0
## 4      B     0     0     0     1     0
## Source: local data frame [4 x 6]
## Groups: id [2]
## 
##       id    d1    d2    d3    d4    d5
##   (fctr) (dbl) (dbl) (dbl) (dbl) (dbl)
## 1      A     1     1     0     0     0
## 2      A     0     0     1     1     1
## 3      B     1     1     1     0     0
## 4      B     0     0     0     1     0
## Source: local data frame [4 x 6]
## Groups: id [2]
## 
##       id    d1    d2    d3    d4    d5
##   (fctr) (dbl) (dbl) (dbl) (dbl) (dbl)
## 1      A     1     1     0     0     0
## 2      A     0     0     1     1     1
## 3      B     1     1     1     0     0
## 4      B     0     0     0     1     0
## Source: local data frame [2 x 2]
## 
##       id   one
##   (fctr) (lgl)
## 1      A  TRUE
## 2      B  TRUE

What I admittedly don't know is, why d is printed three times and not just twice...

like image 22
Stibu Avatar answered Mar 23 '23 11:03

Stibu