Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Combine a list of data frames into one data frame by row

I have code that at one place ends up with a list of data frames which I really want to convert to a single big data frame.

I got some pointers from an earlier question which was trying to do something similar but more complex.

Here's an example of what I am starting with (this is grossly simplified for illustration):

listOfDataFrames <- vector(mode = "list", length = 100)  for (i in 1:100) {     listOfDataFrames[[i]] <- data.frame(a=sample(letters, 500, rep=T),                              b=rnorm(500), c=rnorm(500)) } 

I am currently using this:

  df <- do.call("rbind", listOfDataFrames) 
like image 460
JD Long Avatar asked May 17 '10 17:05

JD Long


People also ask

How do I combine a list of data frames in R?

To combine data frames stored in a list in R, we can use full_join function of dplyr package inside Reduce function.

How do I bind rows in R?

The binding or combining of the rows is very easy with the rbind() function in R. rbind() stands for row binding. In simpler terms joining of multiple rows to form a single batch. It may include joining two data frames, vectors, and more.

How do I merge data frames with dplyr?

We can merge two data frames in R by using the merge() function or by using family of join() function in dplyr package. The data frames must have same column names on which the merging happens. Merge() Function in R is similar to database join operation in SQL.

How do I bind columns in R?

You can use cbind() function in R to bind multiple columns with ease. It is also a function having a simple syntax as well. You can bind data frames, vectors and multiple columns using this function.


2 Answers

Use bind_rows() from the dplyr package:

bind_rows(list_of_dataframes, .id = "column_label") 
like image 158
joeklieg Avatar answered Oct 12 '22 19:10

joeklieg


One other option is to use a plyr function:

df <- ldply(listOfDataFrames, data.frame) 

This is a little slower than the original:

> system.time({ df <- do.call("rbind", listOfDataFrames) })    user  system elapsed     0.25    0.00    0.25  > system.time({ df2 <- ldply(listOfDataFrames, data.frame) })    user  system elapsed     0.30    0.00    0.29 > identical(df, df2) [1] TRUE 

My guess is that using do.call("rbind", ...) is going to be the fastest approach that you will find unless you can do something like (a) use a matrices instead of a data.frames and (b) preallocate the final matrix and assign to it rather than growing it.

Edit 1:

Based on Hadley's comment, here's the latest version of rbind.fill from CRAN:

> system.time({ df3 <- rbind.fill(listOfDataFrames) })    user  system elapsed     0.24    0.00    0.23  > identical(df, df3) [1] TRUE 

This is easier than rbind, and marginally faster (these timings hold up over multiple runs). And as far as I understand it, the version of plyr on github is even faster than this.

like image 36
Shane Avatar answered Oct 12 '22 17:10

Shane