I'm supposed to download a table from MS-SQL server.
The number of row is larger than 6million. The server cannot return entire data at once.
So, I wrote a code that downloads 10,000 rows at a time. and, it binds rows in the loop.
Assume that getData()
function returns a data frame contains 10000 rows at a time. (Pseudo Code)
for(i in 1:600) { tempValue <- getData() wannagetValue <- rbind(wannagetValue,tempValue) print(i) }
The problem is that it gets slower as time goes by.
I think using rbind like that way is not a good idea.
Any advice will be very helpful. Thank you in advance.
As many before me have documented, I also find that rbindlist() is the fastest method and rbind() is the slowest.
rbind throws an error in such a case whereas bind_rows assigns " NA " to those rows of columns missing in one of the data frames where the value is not provided by the data frames.
cbind() and rbind() both create matrices by combining several vectors of the same length. cbind() combines vectors as columns, while rbind() combines them as rows.
The rbind() function represents a row bind function for vectors, data frames, and matrices to be arranged as rows. It is used to combine multiple data frames for data manipulation.
Here are a few options that I'm sure could be better:
library(data.table) library(microbenchmark) #function to generate your data getData <- function(){ data.frame(x=rnorm(10000),y=rnorm(10000),z=rnorm(10000)) } #using data table's rbindlist each iteration fDT1 <- function(n){ dat <- getData() for(i in 1:n){ dat <- rbindlist(list(dat,getData())) } return(data.frame(dat)) } #using data table's rbindlist all at once fDT2 <- function(n){ return(data.frame(rbindlist(lapply(1:n,function(x) getData())))) } #pre-allocating a data frame fPre <- function(n){ dat <- data.frame(x=rep(0,n*10000),y=rep(0,n*10000),z=rep(0,n*10000)) j <- 1 for(i in 1:n){ dat[j:(j+10000-1),] <- getData() j <- j + 10000 } return(dat) } #standard do.call rbind f2 <- function(n){ return(do.call(rbind,lapply(1:n,function(x) getData()))) } #current approach f <- function(n){ dat <- getData() for(i in 1:n){ dat <- rbind(dat,getData()) } return(dat) }
As you can see using data.table
's rbindlist()
is a big improvement over base R's rbind()
and there is a big benefit in appending rows all at once instead of in interations, however that may not be possible if there are memory concerns. You may also note that the speed improvements are nowhere near linear as the size of data increases.
> microbenchmark(fDT2(5),fDT1(5),fPre(5),f2(5),f(5), + fDT2(25),fDT1(25),fPre(25),f2(25),f(25), + fDT2(75),fDT1(75),fPre(75),f2(75),f(75), + times=10) Unit: milliseconds expr min lq median uq max neval fDT2(5) 18.31207 18.63969 24.09943 25.45590 72.01725 10 fDT1(5) 27.65459 29.25147 36.34158 77.79446 88.82556 10 fPre(5) 34.96257 39.39723 41.24445 43.30319 68.75897 10 f2(5) 30.85883 33.00292 36.29100 43.53619 93.15869 10 f(5) 87.40869 97.97500 134.50600 138.65354 147.67676 10 fDT2(25) 89.42274 99.39819 103.90944 146.44160 156.01653 10 fDT1(25) 224.65745 229.78129 261.52388 280.85499 300.93488 10 fPre(25) 371.12569 412.79876 431.80571 485.37727 1046.96923 10 f2(25) 221.03669 252.08998 265.17357 271.82414 281.47096 10 f(25) 1446.32145 1481.01998 1491.59203 1634.99936 1849.00590 10 fDT2(75) 326.66743 334.15669 367.83848 467.85480 520.27142 10 fDT1(75) 1749.83842 1882.27091 2066.95241 2278.55589 2419.07205 10 fPre(75) 3701.16220 3968.64643 4162.70585 4234.39716 4356.09462 10 f2(75) 1174.47546 1183.98860 1314.64585 1421.09483 1537.42903 10 f(75) 9139.36935 9349.24412 9510.90888 9977.24621 10861.51206 10
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With