<p>I have a list of data frames with different sets of columns. I would like to combine them by rows into one data frame. I use <code>plyr::rbind.fill</code> to do that. I am looking for something that would do this more efficiently, but is similar to the answer given here</p> <pre class="prettyprint"><code>require(plyr) set.seed(45) sample.fun <- function() { nam <- sample(LETTERS, sample(5:15)) val <- data.frame(matrix(sample(letters, length(nam)*10,replace=TRUE),nrow=10)) setNames(val, nam) } ll <- replicate(1e4, sample.fun()) rbind.fill(ll) </code></pre>

<p><strong>UPDATE:</strong> See this updated answer instead.</p> <p><strong>UPDATE (eddi):</strong> This has now been implemented in version 1.8.11 as a <code>fill</code> argument to <code>rbind</code>. For example:</p> <pre class="prettyprint"><code>DT1 = data.table(a = 1:2, b = 1:2) DT2 = data.table(a = 3:4, c = 1:2) rbind(DT1, DT2, fill = TRUE) # a b c #1: 1 1 NA #2: 2 2 NA #3: 3 NA 1 #4: 4 NA 2 </code></pre> <hr> <h3> FR #4790 added now - rbind.fill (from plyr) like functionality to merge list of data.frames/data.tables</h3> <h3>Note 1:</h3> <p>This solution uses <code>data.table</code>'s <code>rbindlist</code> function to "rbind" list of data.tables and for this, <strong>be sure to use version 1.8.9 because of this bug in versions < 1.8.9</strong>.</p> <h3>Note 2:</h3> <p><code>rbindlist</code> when binding lists of data.frames/data.tables, as of now, will retain the data type of the first column. That is, if a column in first data.frame is character and the same column in the 2nd data.frame is "factor", then, <code>rbindlist</code> will result in this column being a character. So, if your data.frame consisted of all character columns, then, your solution with this method will be identical to the plyr method. If not, the values will still be the same, but some columns will be character instead of factor. You'll have to convert to "factor" yourself after. Hopefully this behaviour will change in the future.</p> <p>And now here's using <code>data.table</code> (and benchmarking comparison with <code>rbind.fill</code> from <code>plyr</code>):</p> <pre class="prettyprint"><code>require(data.table) rbind.fill.DT <- function(ll) { # changed sapply to lapply to return a list always all.names <- lapply(ll, names) unq.names <- unique(unlist(all.names)) ll.m <- rbindlist(lapply(seq_along(ll), function(x) { tt <- ll[[x]] setattr(tt, 'class', c('data.table', 'data.frame')) data.table:::settruelength(tt, 0L) invisible(alloc.col(tt)) tt[, c(unq.names[!unq.names %chin% all.names[[x]]]) := NA_character_] setcolorder(tt, unq.names) })) } rbind.fill.PLYR <- function(ll) { rbind.fill(ll) } require(microbenchmark) microbenchmark(t1 <- rbind.fill.DT(ll), t2 <- rbind.fill.PLYR(ll), times=10) # Unit: seconds # expr min lq median uq max neval # t1 <- rbind.fill.DT(ll) 10.8943 11.02312 11.26374 11.34757 11.51488 10 # t2 <- rbind.fill.PLYR(ll) 121.9868 134.52107 136.41375 184.18071 347.74724 10 # for comparison change t2 to data.table setattr(t2, 'class', c('data.table', 'data.frame')) data.table:::settruelength(t2, 0L) invisible(alloc.col(t2)) setcolorder(t2, unique(unlist(sapply(ll, names)))) identical(t1, t2) # [1] TRUE </code></pre> <p>It should be noted that <code>plyr</code>'s <code>rbind.fill</code> edges past this particular <code>data.table</code> solution until list size of about 500.</p> <h3>Benchmarking plot:</h3> <p>Here's the plot on runs with list length of data.frames with <code>seq(1000, 10000, by=1000)</code>. I've used <code>microbenchmark</code> with 10 reps on each of these different list lengths.</p> <p><img src="https://i.stack.imgur.com/N3zIX.png" alt="enter image description here"></p> <h3>Benchmarking gist:</h3> <p><strong>Here's the gist for benchmarking</strong>, in case anyone wants to replicate the results.</p>

Efficient way to rbind data.frames with different columns

Tags:

r

data.table

rbind

I have a list of data frames with different sets of columns. I would like to combine them by rows into one data frame. I use plyr::rbind.fill to do that. I am looking for something that would do this more efficiently, but is similar to the answer given here

require(plyr)  set.seed(45) sample.fun <- function() {    nam <- sample(LETTERS, sample(5:15))    val <- data.frame(matrix(sample(letters, length(nam)*10,replace=TRUE),nrow=10))    setNames(val, nam)   } ll <- replicate(1e4, sample.fun()) rbind.fill(ll)

612

asked Aug 01 '13 20:08

mrkilinc

1 Answers

UPDATE: See this updated answer instead.

UPDATE (eddi): This has now been implemented in version 1.8.11 as a fill argument to rbind. For example:

DT1 = data.table(a = 1:2, b = 1:2) DT2 = data.table(a = 3:4, c = 1:2)  rbind(DT1, DT2, fill = TRUE) #   a  b  c #1: 1  1 NA #2: 2  2 NA #3: 3 NA  1 #4: 4 NA  2

FR #4790 added now - rbind.fill (from plyr) like functionality to merge list of data.frames/data.tables

Note 1:

This solution uses data.table's rbindlist function to "rbind" list of data.tables and for this, be sure to use version 1.8.9 because of this bug in versions < 1.8.9.

Note 2:

rbindlist when binding lists of data.frames/data.tables, as of now, will retain the data type of the first column. That is, if a column in first data.frame is character and the same column in the 2nd data.frame is "factor", then, rbindlist will result in this column being a character. So, if your data.frame consisted of all character columns, then, your solution with this method will be identical to the plyr method. If not, the values will still be the same, but some columns will be character instead of factor. You'll have to convert to "factor" yourself after. Hopefully this behaviour will change in the future.

And now here's using data.table (and benchmarking comparison with rbind.fill from plyr):

require(data.table) rbind.fill.DT <- function(ll) {     # changed sapply to lapply to return a list always     all.names <- lapply(ll, names)     unq.names <- unique(unlist(all.names))     ll.m <- rbindlist(lapply(seq_along(ll), function(x) {         tt <- ll[[x]]         setattr(tt, 'class', c('data.table', 'data.frame'))         data.table:::settruelength(tt, 0L)         invisible(alloc.col(tt))         tt[, c(unq.names[!unq.names %chin% all.names[[x]]]) := NA_character_]         setcolorder(tt, unq.names)     })) }  rbind.fill.PLYR <- function(ll) {     rbind.fill(ll) }  require(microbenchmark) microbenchmark(t1 <- rbind.fill.DT(ll), t2 <- rbind.fill.PLYR(ll), times=10) # Unit: seconds #                      expr      min        lq    median        uq       max neval #   t1 <- rbind.fill.DT(ll)  10.8943  11.02312  11.26374  11.34757  11.51488    10 # t2 <- rbind.fill.PLYR(ll) 121.9868 134.52107 136.41375 184.18071 347.74724    10   # for comparison change t2 to data.table setattr(t2, 'class', c('data.table', 'data.frame')) data.table:::settruelength(t2, 0L) invisible(alloc.col(t2)) setcolorder(t2, unique(unlist(sapply(ll, names))))  identical(t1, t2) # [1] TRUE

It should be noted that plyr's rbind.fill edges past this particular data.table solution until list size of about 500.

Benchmarking plot:

Here's the plot on runs with list length of data.frames with seq(1000, 10000, by=1000). I've used microbenchmark with 10 reps on each of these different list lengths.

enter image description here

Benchmarking gist:

Here's the gist for benchmarking, in case anyone wants to replicate the results.

answered Sep 28 '22 07:09

Arun

Related questions
                            
                                Best way to allocate matrix in R, NULL vs NA?
                            
                                How to get coefficients and their confidence intervals in mixed effects models?
                            
                                Circular Heatmap that looks like a donut
                            
                                How to add documentation to a data.frame in R?
                            
                                List of Defined Variables in R
                            
                                Hyperlinking text in a ggplot2 visualization
                            
                                Create a PDF table
                            
                                Split up a dataframe by number of rows
                            
                                Reversed order after coord_flip in R
                            
                                What does the capital letter "I" in R linear regression formula mean?
                            
                                Plotting color map with zip codes in R or Python
                            
                                Error in if/while (condition) { : argument is of length zero
                            
                                How to use facets with a dual y-axis ggplot
                            
                                Difference between Objective and feval in xgboost
                            
                                Include code from an external R script, run in, display both code and output
                            
                                Tidyverse approach to binding unnamed list of unnamed vectors by row - do.call(rbind,x) equivalent
                            
                                Sorting a boxplot based on median value
                            
                                sp::over() for point in polygon analysis
                            
                                lapply vs for loop - Performance R
                            
                                Round a POSIX date (POSIXct) with base R functionality

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With