I've got a list (length = 5000) of tibbles that I want to merge. They've all got the same columns so I thought of merging using dplyr::bind_rows
. On the face of it binding rows per each added tibble is very quick, however the execution time increases exponentially instead of linearly as more tibbles are being added.
Having done some googling, it's very much like the bug observed here: https://github.com/tidyverse/dplyr/issues/1396. Even though the bug is supposed to have been fixed in the bind_rows internals, I'm still seeing an exponential increase in elapsed time per tibble.
library(foreach)
library(tidyverse)
set.seed(123456)
tibbles <- foreach(i = 1:200) %do% {
tibble(a = rnorm(10000),
b = rep(letters[1:25], 400),
c = rnorm(10000))
}
times <- foreach(i = 1:200) %do% {
system.time(tibbles[1:i] %>%
purrr::reduce(bind_rows))
}
times %>%
map_dbl(.f = ~.x[3]) %>%
plot(ylab = "time [s] per added tibble")
Any ideas why this is the case and how to solve it?
Thanks.
My guess is that every time you call rbind
, R has to allocate a new set of columns and copy the data over. This would lead to a quadratic increase in time.
Try pre-allocating the columns instead:
system.time({
n <- vapply(tibbles, nrow, 0)
ntot <- sum(n)
cols <- list(a = numeric(ntot), b = character(ntot), c = numeric(ntot))
off <- 0
for (i in seq_along(tibbles)) {
ix <- off + seq_len(n[[i]])
for (j in seq_along(cols)) {
cols[[j]][ix] <- tibbles[[i]][[j]]
}
off <- off + n[[i]]
}
result <- as_tibble(cols)
})
#> user system elapsed
#> 0.073 0.012 0.085
Compare with the purrr::reduce
approach:
system.time(tibbles[1:200] %>% purrr::reduce(bind_rows))
#> user system elapsed
#> 4.888 2.013 6.928
Although, as aosmith notes, in your situation it's better to just use bind_rows
:
system.time(result <- bind_rows(tibbles))
#> user system elapsed
#> 0.039 0.005 0.044
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With