Growing a data.frame in a memory-efficient manner

Tags:

According to Creating an R dataframe row-by-row, it's not ideal to append to a data.frame using rbind, as it creates a copy of the whole data.frame each time. How do I accumulate data in R resulting in a data.frame without incurring this penalty? The intermediate format doesn't need to be a data.frame.

771

asked Jul 14 '12 18:07

Reactormonk

1 Answers

First approach

I tried accessing each element of a pre-allocated data.frame:

Click to copy

res <- data.frame(x=rep(NA,1000), y=rep(NA,1000)) tracemem(res) for(i in 1:1000) {   res[i,"x"] <- runif(1)   res[i,"y"] <- rnorm(1) }

But tracemem goes crazy (e.g. the data.frame is being copied to a new address each time).

Alternative approach (doesn't work either)

One approach (not sure it's faster as I haven't benchmarked yet) is to create a list of data.frames, then stack them all together:

Click to copy

makeRow <- function() data.frame(x=runif(1),y=rnorm(1)) res <- replicate(1000, makeRow(), simplify=FALSE ) # returns a list of data.frames library(taRifx) res.df <- stack(res)

Unfortunately in creating the list I think you will be hard-pressed to pre-allocate. For instance:

Click to copy

> tracemem(res) [1] "<0x79b98b0>" > res[[2]] <- data.frame() tracemem[0x79b98b0 -> 0x71da500]:

In other words, replacing an element of the list causes the list to be copied. I assume the whole list, but it's possible it's only that element of the list. I'm not intimately familiar with the details of R's memory management.

Probably the best approach

As with many speed or memory-limited processes these days, the best approach may well be to use data.table instead of a data.frame. Since data.table has the := assign by reference operator, it can update without re-copying:

Click to copy

library(data.table) dt <- data.table(x=rep(0,1000), y=rep(0,1000)) tracemem(dt) for(i in 1:1000) {   dt[i,x := runif(1)]   dt[i,y := rnorm(1)] } # note no message from tracemem

But as @MatthewDowle points out, set() is the appropriate way to do this inside a loop. Doing so makes it faster still:

Click to copy

library(data.table) n <- 10^6 dt <- data.table(x=rep(0,n), y=rep(0,n))  dt.colon <- function(dt) {   for(i in 1:n) {     dt[i,x := runif(1)]     dt[i,y := rnorm(1)]   } }  dt.set <- function(dt) {   for(i in 1:n) {     set(dt,i,1L, runif(1) )     set(dt,i,2L, rnorm(1) )   } }  library(microbenchmark) m <- microbenchmark(dt.colon(dt), dt.set(dt),times=2)

(Results shown below)

Benchmarking

With the loop run 10,000 times, data table is almost a full order of magnitude faster:

Click to copy

Unit: seconds           expr        min         lq     median         uq        max 1    test.df()  523.49057  523.49057  524.52408  525.55759  525.55759 2    test.dt()   62.06398   62.06398   62.98622   63.90845   63.90845 3 test.stack() 1196.30135 1196.30135 1258.79879 1321.29622 1321.29622

benchmarks

And comparison of := with set():

Click to copy

> m Unit: milliseconds           expr       min        lq    median       uq      max 1 dt.colon(dt) 654.54996 654.54996 656.43429 658.3186 658.3186 2   dt.set(dt)  13.29612  13.29612  15.02891  16.7617  16.7617

Note that n here is 10^6 not 10^5 as in the benchmarks plotted above. So there's an order of magnitude more work, and the result is measured in milliseconds not seconds. Impressive indeed.

193

answered Sep 28 '22 11:09

Ari B. Friedman

Related questions
                            
                                rMaps ichoropleth with custom map/shp
                            
                                what's preventing additions to the current set of R reserved words/symbols?
                            
                                Must R Packages Unload Dynamic Libraries When They Unload?
                            
                                How to specify columns in facet_grid OR how to change labels in facet_wrap
                            
                                Why does data.table update names(DT) by reference, even if I assign to another variable?
                            
                                How to pass extra argument to the function argument of do.call in R
                            
                                How to install R package from private repo using devtools install_github?
                            
                                Release memory in R
                            
                                Changing font in PDF produced by rmarkdown
                            
                                Set the size of ggsave exactly
                            
                                How to do printf in r?
                            
                                R Random Forests Variable Importance
                            
                                What is the difference between a list and a pairlist in R?
                            
                                How to draw a nice arrow in ggplot2
                            
                                How to check the amount of RAM in R
                            
                                How do I prevent "r 'library' or 'require' calls not declared" warnings when developing a package?
                            
                                Creating vector of results of repeated function calls in R
                            
                                ggplot2 - The unit of size
                            
                                Why (or when) is Rscript (or littler) better than R CMD BATCH?
                            
                                Where should I put data for automated tests with testthat?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Growing a data.frame in a memory-efficient manner

Tags:

memory

dataframe

r

Reactormonk

People also ask

1 Answers

Ari B. Friedman

Recent Activity

Donate For Us