Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Convert Mixed-Length named List to data.frame

Tags:

dataframe

r

I have a list of the following format:

[[1]]
[[1]]$a
[1] 1

[[1]]$b
[1] 3

[[1]]$c
[1] 5

[[2]]       
[[2]]$c
[1] 2

[[2]]$a
[1] 3

There is a predefined list of possible "keys" (a, b, and c, in this case) and each element in the list ("row") will have values defined for one or more of these keys. I'm looking for a fast way to get from the list structure above to a data.frame which would look like the following, in this case:

  a  b c
1 1  3 5
2 3 NA 2

Any help would be appreciated!


Appendix

I'm dealing with a table that will have up to 50,000 rows and 3-6 columns, with most of the values specified. I'll be taking the table in from JSON and trying to quickly get it into data.frame structure.

Here's some code to create a sample list of the scale with which I'll be working:

ids <- c("a", "b", "c")
createList <- function(approxSize=100){     
    set.seed(1234)

    fifth <- round(approxSize/5)

    list <- list()
    list[1:(fifth*5)] <- rep(
        list(list(a=1, b=2, c=3), 
                 list(a=3, b=4, c=5),
                 list(a=7, c=9),
                 list(c=6, a=8, b=3),
                 list(b=6)), 
        fifth)

    list
}

Just create a list with approxSize of 50,000 to test the performance on a list of this size.

like image 215
Jeff Allen Avatar asked Apr 01 '13 22:04

Jeff Allen


4 Answers

Here's my initial thought. It doesn't speed up your approach, but it does simplify the code considerably:

# makeDF <- function(List, Names) {
#     m <- t(sapply(List, function(X) unlist(X)[Names], 
#     as.data.frame(m)
# }    

## vapply() is a bit faster than sapply()
makeDF <- function(List, Names) {
    m <- t(vapply(List, 
                  FUN = function(X) unlist(X)[Names], 
                  FUN.VALUE = numeric(length(Names))))
    as.data.frame(m)
}

## Test timing with a 50k-item list
ll <- createList(50000)
nms <- c("a", "b", "c")

system.time(makeDF(ll, nms))
# user  system elapsed 
# 0.47    0.00    0.47 
like image 171
Josh O'Brien Avatar answered Sep 21 '22 12:09

Josh O'Brien


Here is a short answer, I doubt it will be very fast though.

> library(plyr)
> rbind.fill(lapply(x, as.data.frame))
  a  b c
 1 1  3 5
 2 3 NA 2
like image 32
flodel Avatar answered Sep 17 '22 12:09

flodel


If you know the possible values beforehand, and you are dealing with large data, perhaps using data.table and set will be fast

cc <- createList(50000)



system.time({
nas <- rep.int(NA_real_, length(cc))
DT <-  setnames(as.data.table(replicate(length(ids),nas, simplify = FALSE)), ids)

for(xx in seq_along(cc)){

  .n <- names(cc[[xx]])
  for(j in .n){
    set(DT, i = xx, j = j, value = cc[[xx]][[j]])
  }


}

})


# user  system elapsed 
# 0.68    0.01    0.70 

Old (slow solution) for posterity

full <- c('a','b', 'c')

system.time({
for(xx in seq_along(cc)) {
  mm <- setdiff(full, names(cc[[xx]]))
  if(length(mm) || all(names(cc[[xx]]) == full)){
  cc[[xx]] <- as.data.table(cc[[xx]])
  # any missing columns

  if(length(mm)){
  # if required add additional columns
    cc[[xx]][, (mm) := as.list(rep(NA_real_, length(mm)))]
  }
  # put columns in correct order
  setcolorder(cc[[xx]], full) 
  }
}

 cdt <- rbindlist(cc)
})

#   user  system elapsed 
# 21.83    0.06   22.00 

This second solution has been left here to show how data.table can be used poorly.

like image 29
mnel Avatar answered Sep 17 '22 12:09

mnel


I know this is an old question, but I just came across it and it's excruciating not to see the simplest solution I'm aware of. So here it is (simply specify 'fill=TRUE' in rbindlist):

library(data.table)
list = list(list(a=1,b=3,c=5),list(c=2,a=3))
rbindlist(list,fill=TRUE)

#    a  b c
# 1: 1  3 5
# 2: 3 NA 2

I don't know if this is the fastest way, but I'd be willing to bet that it competes, given data.table's thoughtful design and extremely good performance on a lot of other tasks.

like image 42
Matt Hawthorn Avatar answered Sep 20 '22 12:09

Matt Hawthorn