Just had a conversation with coworkers about this, and we thought it'd be worth seeing what people out in SO land had to say. Suppose I had a list with N elements, where each element was a vector of length X. Now suppose I wanted to transform that into a data.frame. As with most things in R, there are multiple ways of skinning the proverbial cat, such as as.dataframe
, using the plyr package, comboing do.call
with cbind
, pre-allocating the DF and filling it in, and others.
The problem that was presented was what happens when either N or X (in our case it is X) becomes extremely large. Is there one cat skinning method that's notably superior when efficiency (particularly in terms of memory) is of the essence?
Vectorization is always the first and best choice. You can convert the data frame to NumPy array or into dictionary format to speed up the iteration workflow. Iterating through the key-value pair of dictionaries comes out to be the fastest way with around 280x times speed up for 20 million records.
Most straightforward row iteration Despite its ease of use and intuitive nature, iterrows() is one of the slowest ways to iterate over rows. This article will also look at how you can substitute iterrows() for itertuples() or apply() to speed up iteration.
To convert Pandas DataFrame to List in Python, use the DataFrame. values(). tolist() function.
Since a data.frame
is already a list and you know that each list element is the same length (X), the fastest thing would probably be to just update the class
and row.names
attributes:
set.seed(21) n <- 1e6 x <- list(x=rnorm(n), y=rnorm(n), z=rnorm(n)) x <- c(x,x,x,x,x,x) system.time(a <- as.data.frame(x)) system.time(b <- do.call(data.frame,x)) system.time({ d <- x # Skip 'c' so Joris doesn't down-vote me! ;-) class(d) <- "data.frame" rownames(d) <- 1:n names(d) <- make.unique(names(d)) }) identical(a, b) # TRUE identical(b, d) # TRUE
Update - this is ~2x faster than creating d
:
system.time({ e <- x attr(e, "row.names") <- c(NA_integer_,n) attr(e, "class") <- "data.frame" attr(e, "names") <- make.names(names(e), unique=TRUE) }) identical(d, e) # TRUE
Update 2 - I forgot about memory consumption. The last update makes two copies of e
. Using the attributes
function reduces that to only one copy.
set.seed(21) f <- list(x=rnorm(n), y=rnorm(n), z=rnorm(n)) f <- c(f,f,f,f,f,f) tracemem(f) system.time({ # makes 2 copies attr(f, "row.names") <- c(NA_integer_,n) attr(f, "class") <- "data.frame" attr(f, "names") <- make.names(names(f), unique=TRUE) }) set.seed(21) g <- list(x=rnorm(n), y=rnorm(n), z=rnorm(n)) g <- c(g,g,g,g,g,g) tracemem(g) system.time({ # only makes 1 copy attributes(g) <- list(row.names=c(NA_integer_,n), class="data.frame", names=make.names(names(g), unique=TRUE)) }) identical(f,g) # TRUE
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With