I have code that at one place ends up with a list of data frames which I really want to convert to a single big data frame.
I got some pointers from an earlier question which was trying to do something similar but more complex.
Here's an example of what I am starting with (this is grossly simplified for illustration):
listOfDataFrames <- vector(mode = "list", length = 100) for (i in 1:100) { listOfDataFrames[[i]] <- data.frame(a=sample(letters, 500, rep=T), b=rnorm(500), c=rnorm(500)) }
I am currently using this:
df <- do.call("rbind", listOfDataFrames)
To combine data frames stored in a list in R, we can use full_join function of dplyr package inside Reduce function.
The binding or combining of the rows is very easy with the rbind() function in R. rbind() stands for row binding. In simpler terms joining of multiple rows to form a single batch. It may include joining two data frames, vectors, and more.
We can merge two data frames in R by using the merge() function or by using family of join() function in dplyr package. The data frames must have same column names on which the merging happens. Merge() Function in R is similar to database join operation in SQL.
You can use cbind() function in R to bind multiple columns with ease. It is also a function having a simple syntax as well. You can bind data frames, vectors and multiple columns using this function.
Use bind_rows()
from the dplyr package:
bind_rows(list_of_dataframes, .id = "column_label")
One other option is to use a plyr function:
df <- ldply(listOfDataFrames, data.frame)
This is a little slower than the original:
> system.time({ df <- do.call("rbind", listOfDataFrames) }) user system elapsed 0.25 0.00 0.25 > system.time({ df2 <- ldply(listOfDataFrames, data.frame) }) user system elapsed 0.30 0.00 0.29 > identical(df, df2) [1] TRUE
My guess is that using do.call("rbind", ...)
is going to be the fastest approach that you will find unless you can do something like (a) use a matrices instead of a data.frames and (b) preallocate the final matrix and assign to it rather than growing it.
Edit 1:
Based on Hadley's comment, here's the latest version of rbind.fill
from CRAN:
> system.time({ df3 <- rbind.fill(listOfDataFrames) }) user system elapsed 0.24 0.00 0.23 > identical(df, df3) [1] TRUE
This is easier than rbind, and marginally faster (these timings hold up over multiple runs). And as far as I understand it, the version of plyr
on github is even faster than this.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With