I've noticed that cbind takes considerably longer than rbind for data.tables. What is the reason for this?
> dt <- as.data.table(mtcars)
> new.dt <- copy(dt)
> timeit({for (i in 1:100) dt.new <- rbind(dt.new, dt)})
user system elapsed
0.237 0.012 0.253
> new.dt <- copy(dt)
> timeit({for (i in 1:100) dt.new <- cbind(dt.new, dt)})
user system elapsed
14.795 0.090 14.912
Where
timeit <- function(expr)
{
ptm <- proc.time()
expr
proc.time() - ptm
}
cbind() and rbind() both create matrices by combining several vectors of the same length. cbind() combines vectors as columns, while rbind() combines them as rows.
As many before me have documented, I also find that rbindlist() is the fastest method and rbind() is the slowest.
The function rbind() is slow, particularly as the data frame gets bigger. You should never use it in a loop. The right way to do it is to initialize the output object at its final size right from the start and then simply fill it in with each turn of the loop.
Ultimately I think this comes down to alloc.col
being slow due to a loop where it removes various attributes from the columns. I'm not entirely sure why that's done, perhaps Arun or Matt can explain.
As you can see below, the basic operations for cbind
are much faster than rbind
:
cbind.dt.simple = function(...) {
x = c(...)
setattr(x, "class", c("data.table", "data.frame"))
ans = .Call(data.table:::Calloccolwrapper, x, max(100L, ncol(x) + 64L), FALSE)
.Call(data.table:::Csetnamed, ans, 0L)
}
library(microbenchmark)
microbenchmark(rbind(dt, dt), cbind(dt, dt), cbind.dt.simple(dt, dt))
#Unit: microseconds
# expr min lq mean median uq max neval
# rbind(dt, dt) 785.318 996.5045 1665.1762 1234.4045 1520.3830 21327.426 100
# cbind(dt, dt) 2350.275 3022.5685 3885.0014 3533.7595 4093.1975 21606.895 100
# cbind.dt.simple(dt, dt) 74.125 116.5290 168.5101 141.9055 180.3035 1903.526 100
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With