dplyr bind_rows execution time exponential

Question

I've got a list (length = 5000) of tibbles that I want to merge. They've all got the same columns so I thought of merging using dplyr::bind_rows. On the face of it binding rows per each added tibble is very quick, however the execution time increases exponentially instead of linearly as more tibbles are being added. Having done some googling, it's very much like the bug observed here: https://github.com/tidyverse/dplyr/issues/1396. Even though the bug is supposed to have been fixed in the bind_rows internals, I'm still seeing an exponential increase in elapsed time per tibble.

library(foreach)
library(tidyverse)
set.seed(123456)
tibbles <- foreach(i = 1:200) %do% {
              tibble(a = rnorm(10000), 
                     b = rep(letters[1:25], 400), 
                     c = rnorm(10000))
}
times <- foreach(i = 1:200) %do% {
            system.time(tibbles[1:i] %>% 
                            purrr::reduce(bind_rows))
}

times %>% 
     map_dbl(.f = ~.x[3]) %>% 
     plot(ylab = "time [s] per added tibble")

Any ideas why this is the case and how to solve it?

Thanks.

elapsed_time

Patrick Perry · Accepted Answer

My guess is that every time you call rbind, R has to allocate a new set of columns and copy the data over. This would lead to a quadratic increase in time.

Try pre-allocating the columns instead:

system.time({
n <- vapply(tibbles, nrow, 0)
ntot <- sum(n)
cols <- list(a = numeric(ntot), b = character(ntot), c = numeric(ntot))

off <- 0
for (i in seq_along(tibbles)) {
    ix <- off + seq_len(n[[i]])
    for (j in seq_along(cols)) {
        cols[[j]][ix] <- tibbles[[i]][[j]]
    }
    off <- off + n[[i]]
}

result <- as_tibble(cols)
})
#>    user  system elapsed 
#>   0.073   0.012   0.085

Compare with the purrr::reduce approach:

system.time(tibbles[1:200] %>% purrr::reduce(bind_rows))
#>   user  system elapsed 
#>  4.888   2.013   6.928

Although, as aosmith notes, in your situation it's better to just use bind_rows:

system.time(result <- bind_rows(tibbles))
#>  user  system elapsed 
#> 0.039   0.005   0.044

dplyr bind_rows execution time exponential

Tags:

r

dplyr

purrr

biomiha

1 Answers

Patrick Perry

Recent Activity

Donate For Us

dplyr bind_rows execution time exponential

Tags:

r

dplyr

purrr

biomiha

1 Answers

Patrick Perry

Related questions

Recent Activity

Donate For Us