Just had a conversation with coworkers about this, and we thought it'd be worth seeing what people out in SO land had to say. Suppose I had a list with N elements, where each element was a vector of length X. Now suppose I wanted to transform that into a data.frame. As with most things in R, there are multiple ways of skinning the proverbial cat, such as <code>as.dataframe</code>, using the plyr package, comboing <code>do.call</code> with <code>cbind</code>, pre-allocating the DF and filling it in, and others. The problem that was presented was what happens when either N or X (in our case it is X) becomes extremely large. Is there one cat skinning method that's notably superior when efficiency (particularly in terms of memory) is of the essence?

Since a <code>data.frame</code> is already a list and you know that each list element is the same length (X), the fastest thing would probably be to just update the <code>class</code> and <code>row.names</code> attributes: <pre class="prettyprint"><code>set.seed(21) n <- 1e6 x <- list(x=rnorm(n), y=rnorm(n), z=rnorm(n)) x <- c(x,x,x,x,x,x) system.time(a <- as.data.frame(x)) system.time(b <- do.call(data.frame,x)) system.time({ d <- x # Skip 'c' so Joris doesn't down-vote me! ;-) class(d) <- "data.frame" rownames(d) <- 1:n names(d) <- make.unique(names(d)) }) identical(a, b) # TRUE identical(b, d) # TRUE </code></pre> Update - this is ~2x faster than creating <code>d</code>: <pre class="prettyprint"><code>system.time({ e <- x attr(e, "row.names") <- c(NA_integer_,n) attr(e, "class") <- "data.frame" attr(e, "names") <- make.names(names(e), unique=TRUE) }) identical(d, e) # TRUE </code></pre> Update 2 - I forgot about memory consumption. The last update makes two copies of <code>e</code>. Using the <code>attributes</code> function reduces that to only one copy. <pre class="prettyprint"><code>set.seed(21) f <- list(x=rnorm(n), y=rnorm(n), z=rnorm(n)) f <- c(f,f,f,f,f,f) tracemem(f) system.time({ # makes 2 copies attr(f, "row.names") <- c(NA_integer_,n) attr(f, "class") <- "data.frame" attr(f, "names") <- make.names(names(f), unique=TRUE) }) set.seed(21) g <- list(x=rnorm(n), y=rnorm(n), z=rnorm(n)) g <- c(g,g,g,g,g,g) tracemem(g) system.time({ # only makes 1 copy attributes(g) <- list(row.names=c(NA_integer_,n), class="data.frame", names=make.names(names(g), unique=TRUE)) }) identical(f,g) # TRUE </code></pre>

Most efficient list to data.frame method?

Tags:

performance

memory-management

dataframe

r

data.table

Just had a conversation with coworkers about this, and we thought it'd be worth seeing what people out in SO land had to say. Suppose I had a list with N elements, where each element was a vector of length X. Now suppose I wanted to transform that into a data.frame. As with most things in R, there are multiple ways of skinning the proverbial cat, such as as.dataframe, using the plyr package, comboing do.call with cbind, pre-allocating the DF and filling it in, and others.

The problem that was presented was what happens when either N or X (in our case it is X) becomes extremely large. Is there one cat skinning method that's notably superior when efficiency (particularly in terms of memory) is of the essence?

766

asked May 09 '11 21:05

geoffjentry

1 Answers

Since a data.frame is already a list and you know that each list element is the same length (X), the fastest thing would probably be to just update the class and row.names attributes:

set.seed(21) n <- 1e6 x <- list(x=rnorm(n), y=rnorm(n), z=rnorm(n)) x <- c(x,x,x,x,x,x)  system.time(a <- as.data.frame(x)) system.time(b <- do.call(data.frame,x)) system.time({   d <- x  # Skip 'c' so Joris doesn't down-vote me! ;-)   class(d) <- "data.frame"   rownames(d) <- 1:n   names(d) <- make.unique(names(d)) })  identical(a, b)  # TRUE identical(b, d)  # TRUE

Update - this is ~2x faster than creating d:

system.time({   e <- x   attr(e, "row.names") <- c(NA_integer_,n)   attr(e, "class") <- "data.frame"   attr(e, "names") <- make.names(names(e), unique=TRUE) })  identical(d, e)  # TRUE

Update 2 - I forgot about memory consumption. The last update makes two copies of e. Using the attributes function reduces that to only one copy.

set.seed(21) f <- list(x=rnorm(n), y=rnorm(n), z=rnorm(n)) f <- c(f,f,f,f,f,f) tracemem(f) system.time({  # makes 2 copies   attr(f, "row.names") <- c(NA_integer_,n)   attr(f, "class") <- "data.frame"   attr(f, "names") <- make.names(names(f), unique=TRUE) })  set.seed(21) g <- list(x=rnorm(n), y=rnorm(n), z=rnorm(n)) g <- c(g,g,g,g,g,g) tracemem(g) system.time({  # only makes 1 copy   attributes(g) <- list(row.names=c(NA_integer_,n),     class="data.frame", names=make.names(names(g), unique=TRUE)) })  identical(f,g)  # TRUE

answered Sep 23 '22 10:09

Joshua Ulrich

Related questions
                            
                                Why is floor() so slow?
                            
                                calendar.getInstance() or calendar.clone()
                            
                                CSS transform vs position
                            
                                Is there a faster alternative to Google Analytics? [closed]
                            
                                Why null-terminated strings? Or: null-terminated vs. characters + length storage
                            
                                Read speed of SharedPreferences
                            
                                Optimizing Lookups: Dictionary key lookups vs. Array index lookups
                            
                                Numpy and line intersections
                            
                                jQuery animate() and browser performance
                            
                                PHP landmines in general [closed]
                            
                                Most appropriate way to get this: $($(".answer")[0])
                            
                                Why is hashCode slower than a similar method?
                            
                                Optimizing numerical array performance in Haskell
                            
                                Significant FMA performance anomaly experienced in the Intel Broadwell processor
                            
                                Javascript: What's the algorithmic performance of 'splice'?
                            
                                Excel VBA Performance - 1 million rows - Delete rows containing a value, in less than 1 min
                            
                                Why is list(x for x in a) faster for a=[0] than for a=[]?
                            
                                When are VBOs faster than "simple" OpenGL primitives (glBegin())?
                            
                                ReSharper sluggishness
                            
                                Speed of calculating powers (in python)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With