Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Growing a data.frame in a memory-efficient manner

According to Creating an R dataframe row-by-row, it's not ideal to append to a data.frame using rbind, as it creates a copy of the whole data.frame each time. How do I accumulate data in R resulting in a data.frame without incurring this penalty? The intermediate format doesn't need to be a data.frame.

like image 771
Reactormonk Avatar asked Jul 14 '12 18:07

Reactormonk


People also ask

What is a data frame how is it formed?

Data frames are the de facto data structure for most tabular data, and what we use for statistics and plotting. A data frame is a collection of vectors of identical lengths. Each vector represents a column, and each vector can be of a different data type (e.g., characters, integers, factors).

What is the purpose of data frames?

A DataFrame is a data structure that organizes data into a 2-dimensional table of rows and columns, much like a spreadsheet. DataFrames are one of the most common data structures used in modern data analytics because they are a flexible and intuitive way of storing and working with data.

What data frame means?

A data frame is a table or a two-dimensional array-like structure in which each column contains values of one variable and each row contains one set of values from each column. Following are the characteristics of a data frame. The column names should be non-empty.


1 Answers

First approach

I tried accessing each element of a pre-allocated data.frame:

res <- data.frame(x=rep(NA,1000), y=rep(NA,1000)) tracemem(res) for(i in 1:1000) {   res[i,"x"] <- runif(1)   res[i,"y"] <- rnorm(1) } 

But tracemem goes crazy (e.g. the data.frame is being copied to a new address each time).

Alternative approach (doesn't work either)

One approach (not sure it's faster as I haven't benchmarked yet) is to create a list of data.frames, then stack them all together:

makeRow <- function() data.frame(x=runif(1),y=rnorm(1)) res <- replicate(1000, makeRow(), simplify=FALSE ) # returns a list of data.frames library(taRifx) res.df <- stack(res) 

Unfortunately in creating the list I think you will be hard-pressed to pre-allocate. For instance:

> tracemem(res) [1] "<0x79b98b0>" > res[[2]] <- data.frame() tracemem[0x79b98b0 -> 0x71da500]:  

In other words, replacing an element of the list causes the list to be copied. I assume the whole list, but it's possible it's only that element of the list. I'm not intimately familiar with the details of R's memory management.

Probably the best approach

As with many speed or memory-limited processes these days, the best approach may well be to use data.table instead of a data.frame. Since data.table has the := assign by reference operator, it can update without re-copying:

library(data.table) dt <- data.table(x=rep(0,1000), y=rep(0,1000)) tracemem(dt) for(i in 1:1000) {   dt[i,x := runif(1)]   dt[i,y := rnorm(1)] } # note no message from tracemem 

But as @MatthewDowle points out, set() is the appropriate way to do this inside a loop. Doing so makes it faster still:

library(data.table) n <- 10^6 dt <- data.table(x=rep(0,n), y=rep(0,n))  dt.colon <- function(dt) {   for(i in 1:n) {     dt[i,x := runif(1)]     dt[i,y := rnorm(1)]   } }  dt.set <- function(dt) {   for(i in 1:n) {     set(dt,i,1L, runif(1) )     set(dt,i,2L, rnorm(1) )   } }  library(microbenchmark) m <- microbenchmark(dt.colon(dt), dt.set(dt),times=2) 

(Results shown below)

Benchmarking

With the loop run 10,000 times, data table is almost a full order of magnitude faster:

Unit: seconds           expr        min         lq     median         uq        max 1    test.df()  523.49057  523.49057  524.52408  525.55759  525.55759 2    test.dt()   62.06398   62.06398   62.98622   63.90845   63.90845 3 test.stack() 1196.30135 1196.30135 1258.79879 1321.29622 1321.29622 

benchmarks

And comparison of := with set():

> m Unit: milliseconds           expr       min        lq    median       uq      max 1 dt.colon(dt) 654.54996 654.54996 656.43429 658.3186 658.3186 2   dt.set(dt)  13.29612  13.29612  15.02891  16.7617  16.7617 

Note that n here is 10^6 not 10^5 as in the benchmarks plotted above. So there's an order of magnitude more work, and the result is measured in milliseconds not seconds. Impressive indeed.

like image 193
Ari B. Friedman Avatar answered Sep 28 '22 11:09

Ari B. Friedman