Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

dplyr summarise evaluates custom function twice? [duplicate]

Tags:

r

dplyr

It seems that summarise and summarise_each are making unnecessary extra calls to the callback functions they are provided with. Suppose that we have the following

X <- data.frame( Group = rep(c("G1","G2"),2:3), Var1 = 1:5, Var2 = 11:15 )

which looks like this:

   Group Var1 Var2
 1    G1    1   11
 2    G1    2   12
 3    G2    3   13
 4    G2    4   14
 5    G2    5   15

Further suppose that we have a (potentially expensive) function

f <- function(v)
{
   cat( "Calling f with vector", v, "\n" )
   ## ...additional bookkeeping and processing...
   mean(v)
}

that we would like to apply to each of our variables in each group. Using dplyr, we might go about it in the following way:

X %>% group_by( Group ) %>% summarise_each( funs(f) )

However, the output shows that f was called one additional time for each variable in G1:

Calling f with vector 1 2 
Calling f with vector 1 2 
Calling f with vector 3 4 5 
Calling f with vector 11 12 
Calling f with vector 11 12 
Calling f with vector 13 14 15 
# A tibble: 2 x 3
   Group  Var1  Var2
  <fctr> <dbl> <dbl> 
1     G1   1.5  11.5
2     G2   4.0  14.0

The same issue is present when using summarize:

> X %>% group_by( Group ) %>% summarise( test = f(Var1) )
Calling f with vector 1 2
Calling f with vector 1 2
Calling f with vector 3 4 5
# A tibble: 2 × 2
   Group  test
  <fctr> <dbl>
1     G1   1.5
2     G2   4.0

Why is this happening and how would one go about preventing summarise and summarise_each from making those extra calls?

(This is using R version 3.3.0 and dplyr version 0.5.0)

EDIT: It appears that the issue has to do with the interplay between group_by and summarise/summarise_each. Without the grouping, no extra calls are made. Also, mutate and mutate_each do not suffer from this issue. (Credit: eddi and eipi10 for these findings)

like image 821
Artem Sokolov Avatar asked Nov 19 '22 03:11

Artem Sokolov


1 Answers

Although this issue is still present in dplyr 0.5.0 (published 2016-06-24), it is fixed in the dplyr GitHub repro. It was fixed with this commit made on 2016-09-24. I've confirmed that I can reproduce the issue when I checkout and build the version at the previous commit, but not when building from that one or subsequent ones.

(And yes, I tried a whole bunch of other ones before I found it. Why I go to such lengths in hope of earning imaginary internet points, I leave as a question for my therapist. :)

In particular, in the function SEXP process_data(const Data& gdf) in inst/include/dplyr/Result/CallbackProcessor.h, note these changes:

  CLASS* obj = static_cast<CLASS*>(this);
  typename Data::group_iterator git = gdf.group_begin();

  RObject first_result = obj->process_chunk(*git);
  ++git; // This line was added

and

  for (int i = 1; i < ngroups; ++git, ++i) { // changed from starting at i = 0
    RObject chunk = obj->process_chunk(*git);

[Comments added by me, not part of the actual source]

like image 146
Tim Goodman Avatar answered Jun 17 '23 01:06

Tim Goodman