Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

cbind vs rbind with data.table

Tags:

r

data.table

I've noticed that cbind takes considerably longer than rbind for data.tables. What is the reason for this?

> dt <- as.data.table(mtcars)                             
> new.dt <- copy(dt)                                      
> timeit({for (i in 1:100) dt.new <- rbind(dt.new, dt)})  
   user  system elapsed                                   
  0.237   0.012   0.253                                   
> new.dt <- copy(dt)                                      
> timeit({for (i in 1:100) dt.new <- cbind(dt.new, dt)})  
   user  system elapsed                                   
 14.795   0.090  14.912    

Where

timeit <- function(expr)
{
    ptm <- proc.time()
    expr
    proc.time() - ptm
}
like image 945
andrew Avatar asked Jun 03 '15 14:06

andrew


People also ask

What is the difference between Rbind and Cbind?

cbind() and rbind() both create matrices by combining several vectors of the same length. cbind() combines vectors as columns, while rbind() combines them as rows.

What is faster than Rbind?

As many before me have documented, I also find that rbindlist() is the fastest method and rbind() is the slowest.

Is Rbind slow?

The function rbind() is slow, particularly as the data frame gets bigger. You should never use it in a loop. The right way to do it is to initialize the output object at its final size right from the start and then simply fill it in with each turn of the loop.


1 Answers

Ultimately I think this comes down to alloc.col being slow due to a loop where it removes various attributes from the columns. I'm not entirely sure why that's done, perhaps Arun or Matt can explain.

As you can see below, the basic operations for cbind are much faster than rbind:

cbind.dt.simple = function(...) {
  x = c(...)
  setattr(x, "class", c("data.table", "data.frame"))
  ans = .Call(data.table:::Calloccolwrapper, x, max(100L, ncol(x) + 64L), FALSE)
  .Call(data.table:::Csetnamed, ans, 0L)
}

library(microbenchmark)

microbenchmark(rbind(dt, dt), cbind(dt, dt), cbind.dt.simple(dt, dt))
#Unit: microseconds
#                    expr      min        lq      mean    median        uq       max neval
#           rbind(dt, dt)  785.318  996.5045 1665.1762 1234.4045 1520.3830 21327.426   100
#           cbind(dt, dt) 2350.275 3022.5685 3885.0014 3533.7595 4093.1975 21606.895   100
# cbind.dt.simple(dt, dt)   74.125  116.5290  168.5101  141.9055  180.3035  1903.526   100
like image 101
eddi Avatar answered Oct 02 '22 18:10

eddi