dplyr summarise() and summarise_each() make extra calls to the provided functions

Tags:

It seems that summarise and summarise_each are making unnecessary extra calls to the callback functions they are provided with. Suppose that we have the following

X <- data.frame( Group = rep(c("G1","G2"),2:3), Var1 = 1:5, Var2 = 11:15 )

which looks like this:

   Group Var1 Var2
 1    G1    1   11
 2    G1    2   12
 3    G2    3   13
 4    G2    4   14
 5    G2    5   15

Further suppose that we have a (potentially expensive) function

f <- function(v)
{
   cat( "Calling f with vector", v, "\n" )
   ## ...additional bookkeeping and processing...
   mean(v)
}

that we would like to apply to each of our variables in each group. Using dplyr, we might go about it in the following way:

X %>% group_by( Group ) %>% summarise_each( funs(f) )

However, the output shows that f was called one additional time for each variable in G1:

Calling f with vector 1 2 
Calling f with vector 1 2 
Calling f with vector 3 4 5 
Calling f with vector 11 12 
Calling f with vector 11 12 
Calling f with vector 13 14 15 
# A tibble: 2 x 3
   Group  Var1  Var2
  <fctr> <dbl> <dbl> 
1     G1   1.5  11.5
2     G2   4.0  14.0

The same issue is present when using summarize:

> X %>% group_by( Group ) %>% summarise( test = f(Var1) )
Calling f with vector 1 2
Calling f with vector 1 2
Calling f with vector 3 4 5
# A tibble: 2 × 2
   Group  test
  <fctr> <dbl>
1     G1   1.5
2     G2   4.0

Why is this happening and how would one go about preventing summarise and summarise_each from making those extra calls?

(This is using R version 3.3.0 and dplyr version 0.5.0)

EDIT: It appears that the issue has to do with the interplay between group_by and summarise/summarise_each. Without the grouping, no extra calls are made. Also, mutate and mutate_each do not suffer from this issue. (Credit: eddi and eipi10 for these findings)

553

asked Aug 30 '16 20:08

Artem Sokolov

1 Answers

Although this issue is still present in dplyr 0.5.0 (published 2016-06-24), it is fixed in the dplyr GitHub repro. It was fixed with this commit made on 2016-09-24. I've confirmed that I can reproduce the issue when I checkout and build the version at the previous commit, but not when building from that one or subsequent ones.

(And yes, I tried a whole bunch of other ones before I found it. Why I go to such lengths in hope of earning imaginary internet points, I leave as a question for my therapist. :)

In particular, in the function SEXP process_data(const Data& gdf) in inst/include/dplyr/Result/CallbackProcessor.h, note these changes:

  CLASS* obj = static_cast<CLASS*>(this);
  typename Data::group_iterator git = gdf.group_begin();

  RObject first_result = obj->process_chunk(*git);
  ++git; // This line was added

and

  for (int i = 1; i < ngroups; ++git, ++i) { // changed from starting at i = 0
    RObject chunk = obj->process_chunk(*git);

[Comments added by me, not part of the actual source]

124

answered Nov 13 '22 13:11

Tim Goodman

Related questions
                            
                                What event should i use for sending a "button pressed" event on Firebase Analytics
                            
                                Android : How to programmatically open the soft keyboard in Emoji View
                            
                                Xamarin Notification Service Extension issue
                            
                                Fabric.io: new app does not show up in the dashboard
                            
                                How do I run a webpack build from a docker container?
                            
                                Git Push Fails with RPC failed; curl 55 SSL_write() returned SYSCALL, errno = 10053
                            
                                Sonarlint command line version dropped?
                            
                                Markdown preview in emacs fails: (pandoc?) error 127
                            
                                matplotlib: Can I use a secondary font for missing glyphs?
                            
                                Variable extraction to var in Intellij IDEA
                            
                                Use Prometheus operator with DB volume for k8s
                            
                                How to generate OpenAPI 3 documentation from protobuf files

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With