It seems that summarise
and summarise_each
are making unnecessary extra calls to the callback functions they are provided with. Suppose that we have the following
X <- data.frame( Group = rep(c("G1","G2"),2:3), Var1 = 1:5, Var2 = 11:15 )
which looks like this:
Group Var1 Var2
1 G1 1 11
2 G1 2 12
3 G2 3 13
4 G2 4 14
5 G2 5 15
Further suppose that we have a (potentially expensive) function
f <- function(v)
{
cat( "Calling f with vector", v, "\n" )
## ...additional bookkeeping and processing...
mean(v)
}
that we would like to apply to each of our variables in each group. Using dplyr
, we might go about it in the following way:
X %>% group_by( Group ) %>% summarise_each( funs(f) )
However, the output shows that f
was called one additional time for each variable in G1:
Calling f with vector 1 2
Calling f with vector 1 2
Calling f with vector 3 4 5
Calling f with vector 11 12
Calling f with vector 11 12
Calling f with vector 13 14 15
# A tibble: 2 x 3
Group Var1 Var2
<fctr> <dbl> <dbl>
1 G1 1.5 11.5
2 G2 4.0 14.0
The same issue is present when using summarize
:
> X %>% group_by( Group ) %>% summarise( test = f(Var1) )
Calling f with vector 1 2
Calling f with vector 1 2
Calling f with vector 3 4 5
# A tibble: 2 × 2
Group test
<fctr> <dbl>
1 G1 1.5
2 G2 4.0
Why is this happening and how would one go about preventing summarise
and summarise_each
from making those extra calls?
(This is using R
version 3.3.0 and dplyr
version 0.5.0)
EDIT: It appears that the issue has to do with the interplay between group_by
and summarise
/summarise_each
. Without the grouping, no extra calls are made. Also, mutate
and mutate_each
do not suffer from this issue. (Credit: eddi and eipi10 for these findings)
Summarize Function in R Programming. As its name implies, the summarize function reduces a data frame to a summary of just one vector or value. Many times, these summaries are calculated by grouping observations using a factor or categorical variables first.
summarise() creates a new data frame. It will have one (or more) rows for each combination of grouping variables; if there are no grouping variables, the output will have a single row summarising all observations in the input.
n() gives the current group size. cur_data() gives the current data for the current group (excluding grouping variables).
Although this issue is still present in dplyr 0.5.0 (published 2016-06-24), it is fixed in the dplyr GitHub repro. It was fixed with this commit made on 2016-09-24. I've confirmed that I can reproduce the issue when I checkout and build the version at the previous commit, but not when building from that one or subsequent ones.
(And yes, I tried a whole bunch of other ones before I found it. Why I go to such lengths in hope of earning imaginary internet points, I leave as a question for my therapist. :)
In particular, in the function SEXP process_data(const Data& gdf)
in inst/include/dplyr/Result/CallbackProcessor.h
, note these changes:
CLASS* obj = static_cast<CLASS*>(this);
typename Data::group_iterator git = gdf.group_begin();
RObject first_result = obj->process_chunk(*git);
++git; // This line was added
and
for (int i = 1; i < ngroups; ++git, ++i) { // changed from starting at i = 0
RObject chunk = obj->process_chunk(*git);
[Comments added by me, not part of the actual source]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With