I need to rbind two large data frames. Right now I use
df <- rbind(df, df.extension)
but I (almost) instantly run out of memory. I guess its because df is held in the memory twice. I might see even bigger data frames in the future, so I need some kind of in-place rbind.
So my question is: Is there a way to avoid data duplication in memory when using rbind?
I found this question, which uses SqlLite, but I really want to avoid using the hard drive as a cache.
table is the fastest with average execution time 428 milliseconds. It's more than twice faster than bind_rows from dplyr , which took an average of 1,050 milliseconds, and more than 10 times faster than rbind from base R, which took an average of 5,358 milliseconds!
cbind() and rbind() both create matrices by combining several vectors of the same length. cbind() combines vectors as columns, while rbind() combines them as rows. Let's use these functions to create a matrix with the numbers 1 through 30.
rbind throws an error in such a case whereas bind_rows assigns " NA " to those rows of columns missing in one of the data frames where the value is not provided by the data frames.
The function rbind() is slow, particularly as the data frame gets bigger. You should never use it in a loop. The right way to do it is to initialize the output object at its final size right from the start and then simply fill it in with each turn of the loop.
data.table
is your friend!
C.f. http://www.mail-archive.com/[email protected]/msg175877.html
Following up on nikola's comment, here is ?rbindlist
's description (new in v1.8.2) :
Same as
do.call("rbind",l)
, but much faster.
First of all : Use the solution from the other question you link to if you want to be safe. As R is call-by-value, forget about an "in-place" method that doesn't copy your dataframes in the memory.
One not advisable method of saving quite a bit of memory, is to pretend your dataframes are lists, coercing a list using a for-loop (apply will eat memory like hell) and make R believe it actually is a dataframe.
I'll warn you again : using this on more complex dataframes is asking for trouble and hard-to-find bugs. So be sure you test well enough, and if possible, avoid this as much as possible.
You could try following approach :
n1 <- 1000000 n2 <- 1000000 ncols <- 20 dtf1 <- as.data.frame(matrix(sample(n1*ncols), n1, ncols)) dtf2 <- as.data.frame(matrix(sample(n2*ncols), n1, ncols)) dtf <- list() for(i in names(dtf1)){ dtf[[i]] <- c(dtf1[[i]],dtf2[[i]]) } attr(dtf,"row.names") <- 1:(n1+n2) attr(dtf,"class") <- "data.frame"
It erases rownames you actually had (you can reconstruct them, but check for duplicate rownames!). It also doesn't carry out all the other tests included in rbind.
Saves you about half of the memory in my tests, and in my test both the dtfcomb and the dtf are equal. The red box is rbind, the yellow one is my list-based approach.
Test script :
n1 <- 3000000 n2 <- 3000000 ncols <- 20 dtf1 <- as.data.frame(matrix(sample(n1*ncols), n1, ncols)) dtf2 <- as.data.frame(matrix(sample(n2*ncols), n1, ncols)) gc() Sys.sleep(10) dtfcomb <- rbind(dtf1,dtf2) Sys.sleep(10) gc() Sys.sleep(10) rm(dtfcomb) gc() Sys.sleep(10) dtf <- list() for(i in names(dtf1)){ dtf[[i]] <- c(dtf1[[i]],dtf2[[i]]) } attr(dtf,"row.names") <- 1:(n1+n2) attr(dtf,"class") <- "data.frame" Sys.sleep(10) gc() Sys.sleep(10) rm(dtf) gc()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With