Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Most efficient list to data.frame method?

Just had a conversation with coworkers about this, and we thought it'd be worth seeing what people out in SO land had to say. Suppose I had a list with N elements, where each element was a vector of length X. Now suppose I wanted to transform that into a data.frame. As with most things in R, there are multiple ways of skinning the proverbial cat, such as as.dataframe, using the plyr package, comboing do.call with cbind, pre-allocating the DF and filling it in, and others.

The problem that was presented was what happens when either N or X (in our case it is X) becomes extremely large. Is there one cat skinning method that's notably superior when efficiency (particularly in terms of memory) is of the essence?

like image 766
geoffjentry Avatar asked May 09 '11 21:05

geoffjentry


People also ask

What is the best way to iterate through a DataFrame?

Vectorization is always the first and best choice. You can convert the data frame to NumPy array or into dictionary format to speed up the iteration workflow. Iterating through the key-value pair of dictionaries comes out to be the fastest way with around 280x times speed up for 20 million records.

Is iterrows fast?

Most straightforward row iteration Despite its ease of use and intuitive nature, iterrows() is one of the slowest ways to iterate over rows. This article will also look at how you can substitute iterrows() for itertuples() or apply() to speed up iteration.

How do you list data frames?

To convert Pandas DataFrame to List in Python, use the DataFrame. values(). tolist() function.


1 Answers

Since a data.frame is already a list and you know that each list element is the same length (X), the fastest thing would probably be to just update the class and row.names attributes:

set.seed(21) n <- 1e6 x <- list(x=rnorm(n), y=rnorm(n), z=rnorm(n)) x <- c(x,x,x,x,x,x)  system.time(a <- as.data.frame(x)) system.time(b <- do.call(data.frame,x)) system.time({   d <- x  # Skip 'c' so Joris doesn't down-vote me! ;-)   class(d) <- "data.frame"   rownames(d) <- 1:n   names(d) <- make.unique(names(d)) })  identical(a, b)  # TRUE identical(b, d)  # TRUE 

Update - this is ~2x faster than creating d:

system.time({   e <- x   attr(e, "row.names") <- c(NA_integer_,n)   attr(e, "class") <- "data.frame"   attr(e, "names") <- make.names(names(e), unique=TRUE) })  identical(d, e)  # TRUE 

Update 2 - I forgot about memory consumption. The last update makes two copies of e. Using the attributes function reduces that to only one copy.

set.seed(21) f <- list(x=rnorm(n), y=rnorm(n), z=rnorm(n)) f <- c(f,f,f,f,f,f) tracemem(f) system.time({  # makes 2 copies   attr(f, "row.names") <- c(NA_integer_,n)   attr(f, "class") <- "data.frame"   attr(f, "names") <- make.names(names(f), unique=TRUE) })  set.seed(21) g <- list(x=rnorm(n), y=rnorm(n), z=rnorm(n)) g <- c(g,g,g,g,g,g) tracemem(g) system.time({  # only makes 1 copy   attributes(g) <- list(row.names=c(NA_integer_,n),     class="data.frame", names=make.names(names(g), unique=TRUE)) })  identical(f,g)  # TRUE 
like image 83
Joshua Ulrich Avatar answered Sep 23 '22 10:09

Joshua Ulrich