Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is using dplyr pipe (%>%) slower than an equivalent non-pipe expression, for high-cardinality group-by?

Tags:

I thought that generally speaking using %>% wouldn't have a noticeable effect on speed. But in this case it runs 4x slower.

library(dplyr)
library(microbenchmark)

set.seed(0)
dummy_data <- dplyr::data_frame(
  id=floor(runif(10000, 1, 10000))
  , label=floor(runif(10000, 1, 4))
)

microbenchmark(dummy_data %>% group_by(id) %>% summarise(list(unique(label))))
microbenchmark(dummy_data %>% group_by(id) %>% summarise(label %>% unique %>% list))

Without pipe:

min       lq     mean   median       uq      max neval
1.691441 1.739436 1.841157 1.812778 1.880713 2.495853   100

With pipe:

min       lq     mean   median       uq      max neval
6.753999 6.969573 7.167802 7.052744 7.195204 8.833322   100

Why is %>% so much slower in this situation? Is there a better way to write this?

EDIT:

I made the data frame smaller and incorporated Moody_Mudskipper's suggestions into the benchmarking.

microbenchmark(
  nopipe=dummy_data %>% group_by(id) %>% summarise(list(unique(label))),
  magrittr=dummy_data %>% group_by(id) %>% summarise(label %>% unique %>% list),
  magrittr2=dummy_data %>% group_by(id) %>% summarise_at('label', . %>% unique %>% list),
  fastpipe=dummy_data %.% group_by(., id) %.% summarise(., label %.% unique(.) %.% list(.))
)

Unit: milliseconds
      expr       min        lq      mean    median        uq      max neval
    nopipe  59.91252  70.26554  78.10511  72.79398  79.29025 214.9245   100
  magrittr 469.09573 525.80084 568.28918 558.05634 590.48409 767.4647   100
 magrittr2  84.06716  95.20952 106.28494 100.32370 110.92373 241.1296   100
  fastpipe  93.57549 103.36926 109.94614 107.55218 111.90049 162.7763   100
like image 522
logworthy Avatar asked Mar 11 '16 05:03

logworthy


1 Answers

What might be a negligible effect in a real-world full application becomes non-negligible when writing one-liners that are time-dependent on the formerly "negligible". I suspect if you profile your tests then most of the time will be in the summarize clause, so lets microbenchmark something similar to that:

> set.seed(99);z=sample(10000,4,TRUE)
> microbenchmark(z %>% unique %>% list, list(unique(z)))
Unit: microseconds
                  expr     min      lq      mean   median      uq     max neval
 z %>% unique %>% list 142.617 144.433 148.06515 145.0265 145.969 297.735   100
       list(unique(z))   9.289   9.988  10.85705  10.5820  11.804  12.642   100

This is doing something a bit different to your code but illustrates the point. Pipes are slower.

Because pipes need to restructure R's calling into the same one that function evaluations are using, and then evaluate them. So it has to be slower. By how much depends on how speedy the functions are. Calls to unique and list are pretty fast in R, so the whole difference here is the pipe overhead.

Profiling expressions like this showed me most of the time is spent in the pipe functions:

                         total.time total.pct self.time self.pct
"microbenchmark"              16.84     98.71      1.22     7.15
"%>%"                         15.50     90.86      1.22     7.15
"eval"                         5.72     33.53      1.18     6.92
"split_chain"                  5.60     32.83      1.92    11.25
"lapply"                       5.00     29.31      0.62     3.63
"FUN"                          4.30     25.21      0.24     1.41
 ..... stuff .....

then somewhere down in about 15th place the real work gets done:

"as.list"                      1.40      8.13      0.66     3.83
"unique"                       1.38      8.01      0.88     5.11
"rev"                          1.26      7.32      0.90     5.23

Whereas if you just call the functions as Chambers intended, R gets straight down to it:

                         total.time total.pct self.time self.pct
"microbenchmark"               2.30     96.64      1.04    43.70
"unique"                       1.12     47.06      0.38    15.97
"unique.default"               0.74     31.09      0.64    26.89
"is.factor"                    0.10      4.20      0.10     4.20

Hence the oft-quoted recommendation that pipes are okay on the command line where your brain thinks in chains, but not in functions that might be time-critical. In practice this overhead will probably get wiped out in one call to glm with a few hundred data points, but that's another story....

like image 180
Spacedman Avatar answered Sep 21 '22 21:09

Spacedman