Memory efficient alternative to rbind - in-place rbind?

Tags:

I need to rbind two large data frames. Right now I use

df <- rbind(df, df.extension)

but I (almost) instantly run out of memory. I guess its because df is held in the memory twice. I might see even bigger data frames in the future, so I need some kind of in-place rbind.

So my question is: Is there a way to avoid data duplication in memory when using rbind?

I found this question, which uses SqlLite, but I really want to avoid using the hard drive as a cache.

865

asked Aug 17 '11 13:08

2 Answers

data.table is your friend!

C.f. http://www.mail-archive.com/[email protected]/msg175877.html

Following up on nikola's comment, here is ?rbindlist's description (new in v1.8.2) :

Same as do.call("rbind",l), but much faster.

198

answered Oct 14 '22 22:10

Ari B. Friedman

First of all : Use the solution from the other question you link to if you want to be safe. As R is call-by-value, forget about an "in-place" method that doesn't copy your dataframes in the memory.

One not advisable method of saving quite a bit of memory, is to pretend your dataframes are lists, coercing a list using a for-loop (apply will eat memory like hell) and make R believe it actually is a dataframe.

I'll warn you again : using this on more complex dataframes is asking for trouble and hard-to-find bugs. So be sure you test well enough, and if possible, avoid this as much as possible.

You could try following approach :

n1 <- 1000000 n2 <- 1000000 ncols <- 20 dtf1 <- as.data.frame(matrix(sample(n1*ncols), n1, ncols)) dtf2 <- as.data.frame(matrix(sample(n2*ncols), n1, ncols))  dtf <- list()  for(i in names(dtf1)){   dtf[[i]] <- c(dtf1[[i]],dtf2[[i]]) }  attr(dtf,"row.names") <- 1:(n1+n2) attr(dtf,"class") <- "data.frame"

It erases rownames you actually had (you can reconstruct them, but check for duplicate rownames!). It also doesn't carry out all the other tests included in rbind.

Saves you about half of the memory in my tests, and in my test both the dtfcomb and the dtf are equal. The red box is rbind, the yellow one is my list-based approach.

enter image description here

Test script :

n1 <- 3000000 n2 <- 3000000 ncols <- 20  dtf1 <- as.data.frame(matrix(sample(n1*ncols), n1, ncols)) dtf2 <- as.data.frame(matrix(sample(n2*ncols), n1, ncols))  gc() Sys.sleep(10) dtfcomb <- rbind(dtf1,dtf2) Sys.sleep(10) gc() Sys.sleep(10) rm(dtfcomb) gc() Sys.sleep(10) dtf <- list() for(i in names(dtf1)){   dtf[[i]] <- c(dtf1[[i]],dtf2[[i]]) } attr(dtf,"row.names") <- 1:(n1+n2) attr(dtf,"class") <- "data.frame" Sys.sleep(10) gc() Sys.sleep(10) rm(dtf) gc()

answered Oct 14 '22 22:10

Joris Meys

Related questions
                            
                                Cannot create an empty vector and append new elements in R
                            
                                how to convert date and time from character to datetime type
                            
                                Error: vector memory exhausted (limit reached?) R 3.5.0 macOS
                            
                                Setting the default value in a function?
                            
                                Running R Code from Command Line (Windows)
                            
                                Missing legend with ggplot2 and geom_line
                            
                                Create a Data Frame of Unequal Lengths
                            
                                What is the difference between `%in%` and `==`?
                            
                                list all factor levels of a data.frame
                            
                                How to measure similarity between strings?
                            
                                Place title of multiplot panel with ggplot2
                            
                                Diagonal labels orientation on x-axis in heatmap(s)
                            
                                After I upgrade my R version, how can I easily reinstall all the packages that were installed in the old version? [duplicate]
                            
                                Determining UTM zone (to convert) from longitude/latitude
                            
                                Sentiment analysis using R [closed]
                            
                                R lubridate converting seconds to date
                            
                                inconsolata missing to build R vignette
                            
                                dplyr join warning: joining factors with different levels
                            
                                How can I add a subtitle and change the font size of ggplot plots in R?
                            
                                How to control the igraph plot layout with Fixed Positions?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Memory efficient alternative to rbind - in-place rbind?

Tags:

dataframe

r

rbind

Sebastian

People also ask

2 Answers

Ari B. Friedman

Joris Meys

Recent Activity

Donate For Us