Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

dplyr bind_rows execution time exponential

Tags:

r

dplyr

purrr

I've got a list (length = 5000) of tibbles that I want to merge. They've all got the same columns so I thought of merging using dplyr::bind_rows. On the face of it binding rows per each added tibble is very quick, however the execution time increases exponentially instead of linearly as more tibbles are being added. Having done some googling, it's very much like the bug observed here: https://github.com/tidyverse/dplyr/issues/1396. Even though the bug is supposed to have been fixed in the bind_rows internals, I'm still seeing an exponential increase in elapsed time per tibble.

library(foreach)
library(tidyverse)
set.seed(123456)
tibbles <- foreach(i = 1:200) %do% {
              tibble(a = rnorm(10000), 
                     b = rep(letters[1:25], 400), 
                     c = rnorm(10000))
}
times <- foreach(i = 1:200) %do% {
            system.time(tibbles[1:i] %>% 
                            purrr::reduce(bind_rows))
}

times %>% 
     map_dbl(.f = ~.x[3]) %>% 
     plot(ylab = "time [s] per added tibble")

Any ideas why this is the case and how to solve it?

Thanks.

elapsed_time

like image 237
biomiha Avatar asked Oct 18 '22 02:10

biomiha


1 Answers

My guess is that every time you call rbind, R has to allocate a new set of columns and copy the data over. This would lead to a quadratic increase in time.

Try pre-allocating the columns instead:

system.time({
n <- vapply(tibbles, nrow, 0)
ntot <- sum(n)
cols <- list(a = numeric(ntot), b = character(ntot), c = numeric(ntot))

off <- 0
for (i in seq_along(tibbles)) {
    ix <- off + seq_len(n[[i]])
    for (j in seq_along(cols)) {
        cols[[j]][ix] <- tibbles[[i]][[j]]
    }
    off <- off + n[[i]]
}

result <- as_tibble(cols)
})
#>    user  system elapsed 
#>   0.073   0.012   0.085

Compare with the purrr::reduce approach:

system.time(tibbles[1:200] %>% purrr::reduce(bind_rows))
#>   user  system elapsed 
#>  4.888   2.013   6.928 

Although, as aosmith notes, in your situation it's better to just use bind_rows:

system.time(result <- bind_rows(tibbles))
#>  user  system elapsed 
#> 0.039   0.005   0.044 
like image 185
Patrick Perry Avatar answered Oct 20 '22 22:10

Patrick Perry