Why is dplyr so slow?

Tags:

Like most people, I'm impressed by Hadley Wickham and what he's done for R -- so i figured that i'd move some functions toward his tidyverse ... having done so i'm left wondering what the point of it all is?

My new dplyr functions are much slower than their base equivalents -- i hope i'm doing something wrong. I'd particularly like some payoff from the effort required to understand non-standard-evaluation.

So, what am i doing wrong? Why is dplyr so slow?

An example:

require(microbenchmark)
require(dplyr)

df <- tibble(
             a = 1:10,
             b = c(1:5, 4:0),
             c = 10:1)

addSpread_base <- function() {
    df[['spread']] <- df[['a']] - df[['b']]
    df
}

addSpread_dplyr <- function() df %>% mutate(spread := a - b)

all.equal(addSpread_base(), addSpread_dplyr())

microbenchmark(addSpread_base(), addSpread_dplyr(), times = 1e4)

Timing results:

Unit: microseconds
              expr     min      lq      mean median      uq       max neval
  addSpread_base()  12.058  15.769  22.07805  24.58  26.435  2003.481 10000
 addSpread_dplyr() 607.537 624.697 666.08964 631.19 636.291 41143.691 10000

So using dplyr functions to transform the data takes about 30x longer -- surely this isn't the intention?

I figured that perhaps this is too easy a case -- and that dplyr would really shine if we had a more realistic case where we are adding a column and sub-setting the data -- but this was worse. As you can see from the timings below, this is ~70x slower than the base approach.

# mutate and substitute
addSpreadSub_base <- function(df, col1, col2) {
    df[['spread']] <- df[['a']] - df[['b']]
    df[, c(col1, col2, 'spread')]
}

addSpreadSub_dplyr <- function(df, col1, col2) {
    var1 <- as.name(col1)
    var2 <- as.name(col2)
    qq <- quo(!!var1 - !!var2)
    df %>% 
        mutate(spread := !!qq) %>% 
        select(!!var1, !!var2, spread)
}

all.equal(addSpreadSub_base(df, col1 = 'a', col2 = 'b'), 
          addSpreadSub_dplyr(df, col1 = 'a', col2 = 'b'))

microbenchmark(addSpreadSub_base(df, col1 = 'a', col2 = 'b'), 
               addSpreadSub_dplyr(df, col1 = 'a', col2 = 'b'), 
               times = 1e4)

Results:

Unit: microseconds
                                           expr      min       lq      mean   median       uq      max neval
  addSpreadSub_base(df, col1 = "a", col2 = "b")   22.725   30.610   44.3874   45.450   53.798  2024.35 10000
 addSpreadSub_dplyr(df, col1 = "a", col2 = "b") 2748.757 2837.337 3011.1982 2859.598 2904.583 44207.81 10000

910

asked Jan 23 '19 10:01

ricardo

1 Answers

These are micro seconds, your dataset has 10 rows, unless you plan on looping on millions of datasets of 10 rows your benchmark is pretty much irrelevant (and in that case I can't imagine a situation where it wouldn't be wise to bind them together as a first step).

Let's do it with a bigger dataset, like 1 million times bigger :

df <- tibble(
  a = 1:10,
  b = c(1:5, 4:0),
  c = 10:1)

df2 <- bind_rows(replicate(1000000,df,F))

addSpread_base <- function(df) {
  df[['spread']] <- df[['a']] - df[['b']]
  df
}
addSpread_dplyr  <- function(df) df %>% mutate(spread = a - b)

microbenchmark::microbenchmark(
  addSpread_base(df2), 
  addSpread_dplyr(df2),
  times = 100)
# Unit: milliseconds
#                 expr      min       lq     mean   median       uq      max neval cld
# addSpread_base(df2) 25.85584 26.93562 37.77010 32.33633 35.67604 170.6507   100   a
# addSpread_dplyr(df2) 26.91690 27.57090 38.98758 33.39769 39.79501 182.2847   100   a

Still quite fast and not much difference.

As for the "whys" of the result that you got, it's because you're using a much more complex function, so it has overheads.

Commenters have pointed that dplyr doesn't try too hard to be fast and maybe it's true when you compare to data.table, and interface is the first concern, but the authors have been working hard on speed as well. Hybrid evaluation for example allows (if I got it right) to execute C code directly on grouped data when aggregating with common functions, which can be much faster than base code, but simple code will always run faster with simple functions.

answered Oct 02 '22 07:10

Moody_Mudskipper

Related questions
                            
                                NASA tiles with leaflet in R
                            
                                How to use R studio on a remote server
                            
                                read.csv() with UTF-8 encoding [duplicate]
                            
                                Using python together with knitr
                            
                                Implementation of LIME on h2o modelling in R
                            
                                Why and How to effectively test beta distributions of R as a normal user?
                            
                                numerical integration of a tricky function
                            
                                R workflow: How to handle hand-cleaning data
                            
                                "by" in data.table (group by) - what am I missing?
                            
                                Can knit2pdf use the global environment?
                            
                                Dates from Excel to R, platform dependency
                            
                                How to make R tm corpus of 100 million tweets?
                            
                                Creating a running counting variable in R?
                            
                                Generate multiple permutations of vector with non-repeating elements
                            
                                Good ways to define functions inside function in R
                            
                                Studying the source code of primitive and internal R functions: How is R connected with C?
                            
                                R Shiny: How to dynamically append arbitrary number of input widgets
                            
                                How to execute linux commands from R via bash under the Windows Subsystem for Linux (WSL)?
                            
                                summarise does not return warning from max when no non-NA values
                            
                                Change default path of shiny server directory

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Why is dplyr so slow?

Tags:

performance

r

dplyr

ricardo

People also ask

1 Answers

Moody_Mudskipper

Recent Activity

Donate For Us