Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficient way to rbind data.frames with different columns

I have a list of data frames with different sets of columns. I would like to combine them by rows into one data frame. I use plyr::rbind.fill to do that. I am looking for something that would do this more efficiently, but is similar to the answer given here

require(plyr)  set.seed(45) sample.fun <- function() {    nam <- sample(LETTERS, sample(5:15))    val <- data.frame(matrix(sample(letters, length(nam)*10,replace=TRUE),nrow=10))    setNames(val, nam)   } ll <- replicate(1e4, sample.fun()) rbind.fill(ll) 
like image 612
mrkilinc Avatar asked Aug 01 '13 20:08

mrkilinc


People also ask

How do I Rbind Dataframes with different columns?

Method 1 : Using plyr package rbind. fill() method in R is an enhancement of the rbind() method in base R, is used to combine data frames with different columns. The column names are number may be different in the input data frames. Missing columns of the corresponding data frames are filled with NA.

Does Rbind work if columns are in different order?

0), rbind has the capacity to to join two data sets with the same name columns even if they are in different order.

How do you Rbind two data frames?

To join two data frames (datasets) vertically, use the rbind function. The two data frames must have the same variables, but they do not have to be in the same order. If data frameA has variables that data frameB does not, then either: Delete the extra variables in data frameA or.

What is faster than Rbind?

table is the fastest with average execution time 428 milliseconds. It's more than twice faster than bind_rows from dplyr , which took an average of 1,050 milliseconds, and more than 10 times faster than rbind from base R, which took an average of 5,358 milliseconds!


1 Answers

UPDATE: See this updated answer instead.

UPDATE (eddi): This has now been implemented in version 1.8.11 as a fill argument to rbind. For example:

DT1 = data.table(a = 1:2, b = 1:2) DT2 = data.table(a = 3:4, c = 1:2)  rbind(DT1, DT2, fill = TRUE) #   a  b  c #1: 1  1 NA #2: 2  2 NA #3: 3 NA  1 #4: 4 NA  2 

FR #4790 added now - rbind.fill (from plyr) like functionality to merge list of data.frames/data.tables

Note 1:

This solution uses data.table's rbindlist function to "rbind" list of data.tables and for this, be sure to use version 1.8.9 because of this bug in versions < 1.8.9.

Note 2:

rbindlist when binding lists of data.frames/data.tables, as of now, will retain the data type of the first column. That is, if a column in first data.frame is character and the same column in the 2nd data.frame is "factor", then, rbindlist will result in this column being a character. So, if your data.frame consisted of all character columns, then, your solution with this method will be identical to the plyr method. If not, the values will still be the same, but some columns will be character instead of factor. You'll have to convert to "factor" yourself after. Hopefully this behaviour will change in the future.

And now here's using data.table (and benchmarking comparison with rbind.fill from plyr):

require(data.table) rbind.fill.DT <- function(ll) {     # changed sapply to lapply to return a list always     all.names <- lapply(ll, names)     unq.names <- unique(unlist(all.names))     ll.m <- rbindlist(lapply(seq_along(ll), function(x) {         tt <- ll[[x]]         setattr(tt, 'class', c('data.table', 'data.frame'))         data.table:::settruelength(tt, 0L)         invisible(alloc.col(tt))         tt[, c(unq.names[!unq.names %chin% all.names[[x]]]) := NA_character_]         setcolorder(tt, unq.names)     })) }  rbind.fill.PLYR <- function(ll) {     rbind.fill(ll) }  require(microbenchmark) microbenchmark(t1 <- rbind.fill.DT(ll), t2 <- rbind.fill.PLYR(ll), times=10) # Unit: seconds #                      expr      min        lq    median        uq       max neval #   t1 <- rbind.fill.DT(ll)  10.8943  11.02312  11.26374  11.34757  11.51488    10 # t2 <- rbind.fill.PLYR(ll) 121.9868 134.52107 136.41375 184.18071 347.74724    10   # for comparison change t2 to data.table setattr(t2, 'class', c('data.table', 'data.frame')) data.table:::settruelength(t2, 0L) invisible(alloc.col(t2)) setcolorder(t2, unique(unlist(sapply(ll, names))))  identical(t1, t2) # [1] TRUE 

It should be noted that plyr's rbind.fill edges past this particular data.table solution until list size of about 500.

Benchmarking plot:

Here's the plot on runs with list length of data.frames with seq(1000, 10000, by=1000). I've used microbenchmark with 10 reps on each of these different list lengths.

enter image description here

Benchmarking gist:

Here's the gist for benchmarking, in case anyone wants to replicate the results.

like image 69
Arun Avatar answered Sep 28 '22 07:09

Arun