Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

do.call(rbind, list) for uneven number of column

Tags:

I have a list, with each element being a character vector, of differing lengths I would like to bind the data as rows, so that the column names 'line up' and if there is extra data then create column and if there is missing data then create NAs

Below is a mock example of the data I am working with

x <- list()
x[[1]] <- letters[seq(2,20,by=2)]
names(x[[1]]) <- LETTERS[c(1:length(x[[1]]))]
x[[2]] <- letters[seq(3,20, by=3)]
names(x[[2]]) <- LETTERS[seq(3,20, by=3)]
x[[3]] <- letters[seq(4,20, by=4)]
names(x[[3]]) <- LETTERS[seq(4,20, by=4)]

The below line would normally be what I would do if I was sure that the format for each element was the same...

do.call(rbind,x)

I was hoping that someone had come up with a nice little solution that matches up the column names and fills in blanks with NAs whilst adding new columns if in the binding process new columns are found...

like image 967
h.l.m Avatar asked Jun 25 '13 22:06

h.l.m


People also ask

Do columns have to be in same order for Rbind?

0), rbind has the capacity to to join two data sets with the same name columns even if they are in different order.

How do I Rbind data frames with different columns in R?

Method 1 : Using plyr package rbind. fill() method in R is an enhancement of the rbind() method in base R, is used to combine data frames with different columns. The column names are number may be different in the input data frames. Missing columns of the corresponding data frames are filled with NA.

What does Rbind function do?

rbind(): The rbind or the row bind function is used to bind or combine the multiple group of rows together.

What is the difference between Rbind and Cbind?

cbind() and rbind() both create matrices by combining several vectors of the same length. cbind() combines vectors as columns, while rbind() combines them as rows. Let's use these functions to create a matrix with the numbers 1 through 30.


2 Answers

rbind.fill is an awesome function that does really well on list of data.frames. But IMHO, for this case, it could be done much faster when the list contains only (named) vectors.

The rbind.fill way

require(plyr)
rbind.fill(lapply(x,function(y){as.data.frame(t(y),stringsAsFactors=FALSE)}))

A more straightforward way (and efficient for this scenario at least):

rbind.named.fill <- function(x) {
    nam <- sapply(x, names)
    unam <- unique(unlist(nam))
    len <- sapply(x, length)
    out <- vector("list", length(len))
    for (i in seq_along(len)) {
        out[[i]] <- unname(x[[i]])[match(unam, nam[[i]])]
    }
    setNames(as.data.frame(do.call(rbind, out), stringsAsFactors=FALSE), unam)
}

Basically, we get total unique names to form the columns of the final data.frame. Then, we create a list with length = input and just fill the rest of the values with NA. This is probably the "trickiest" part as we've to match the names while filling NA. And then, we set names once finally to the columns (which can be set by reference using setnames from data.table package as well if need be).


Now to some benchmarking:

Data:

# generate some huge random data:
set.seed(45)
sample.fun <- function() {
    nam <- sample(LETTERS, sample(5:15))
    val <- sample(letters, length(nam))
    setNames(val, nam)  
}
ll <- replicate(1e4, sample.fun())

Functions:

# plyr's rbind.fill version:
rbind.fill.plyr <- function(x) {
    rbind.fill(lapply(x,function(y){as.data.frame(t(y),stringsAsFactors=FALSE)}))
}

rbind.named.fill <- function(x) {
    nam <- sapply(x, names)
    unam <- unique(unlist(nam))
    len <- sapply(x, length)
    out <- vector("list", length(len))
    for (i in seq_along(len)) {
        out[[i]] <- unname(x[[i]])[match(unam, nam[[i]])]
    }
    setNames(as.data.frame(do.call(rbind, out), stringsAsFactors=FALSE), unam)
}

Update (added GSee's function as well):

foo <- function (...) 
{
  dargs <- list(...)
  all.names <- unique(names(unlist(dargs)))
  out <- do.call(rbind, lapply(dargs, `[`, all.names))
  colnames(out) <- all.names
  as.data.frame(out, stringsAsFactors=FALSE)
}

Benchmarking:

require(microbenchmark)
microbenchmark(t1 <- rbind.named.fill(ll), 
               t2 <- rbind.fill.plyr(ll), 
               t3 <- do.call(foo, ll), times=10)
identical(t1, t2) # TRUE
identical(t1, t3) # TRUE

Unit: milliseconds
                       expr        min         lq     median         uq        max neval
 t1 <- rbind.named.fill(ll)   243.0754   258.4653   307.2575   359.4332   385.6287    10
  t2 <- rbind.fill.plyr(ll) 16808.3334 17139.3068 17648.1882 17890.9384 18220.2534    10
     t3 <- do.call(foo, ll)   188.5139   204.2514   229.0074   339.6309   359.4995    10
like image 191
Arun Avatar answered Sep 20 '22 11:09

Arun


If you want the result to be a matrix...

I recently wrote this function for a co-worker that wanted to rbind vectors into a matrix.

foo <- function (...) 
{
  dargs <- list(...)
  if (!all(vapply(dargs, is.vector, TRUE))) 
      stop("all inputs must be vectors")
  if (!all(vapply(dargs, function(x) !is.null(names(x)), TRUE))) 
      stop("all input vectors must be named.")
  all.names <- unique(names(unlist(dargs)))
  out <- do.call(rbind, lapply(dargs, `[`, all.names))
  colnames(out) <- all.names
  out
}

R > do.call(foo, x)
     A   B   C   D   E   F   G   H   I   J   L   O   R   P   T  
[1,] "b" "d" "f" "h" "j" "l" "n" "p" "r" "t" NA  NA  NA  NA  NA 
[2,] NA  NA  "c" NA  NA  "f" NA  NA  "i" NA  "l" "o" "r" NA  NA 
[3,] NA  NA  NA  "d" NA  NA  NA  "h" NA  NA  "l" NA  NA  "p" "t"
like image 20
GSee Avatar answered Sep 17 '22 11:09

GSee