I have a list of the following format:
[[1]]
[[1]]$a
[1] 1
[[1]]$b
[1] 3
[[1]]$c
[1] 5
[[2]]
[[2]]$c
[1] 2
[[2]]$a
[1] 3
There is a predefined list of possible "keys" (a
, b
, and c
, in this case) and each element in the list ("row") will have values defined for one or more of these keys. I'm looking for a fast way to get from the list structure above to a data.frame which would look like the following, in this case:
a b c
1 1 3 5
2 3 NA 2
Any help would be appreciated!
Appendix
I'm dealing with a table that will have up to 50,000 rows and 3-6 columns, with most of the values specified. I'll be taking the table in from JSON and trying to quickly get it into data.frame structure.
Here's some code to create a sample list of the scale with which I'll be working:
ids <- c("a", "b", "c")
createList <- function(approxSize=100){
set.seed(1234)
fifth <- round(approxSize/5)
list <- list()
list[1:(fifth*5)] <- rep(
list(list(a=1, b=2, c=3),
list(a=3, b=4, c=5),
list(a=7, c=9),
list(c=6, a=8, b=3),
list(b=6)),
fifth)
list
}
Just create a list with approxSize
of 50,000 to test the performance on a list of this size.
Here's my initial thought. It doesn't speed up your approach, but it does simplify the code considerably:
# makeDF <- function(List, Names) {
# m <- t(sapply(List, function(X) unlist(X)[Names],
# as.data.frame(m)
# }
## vapply() is a bit faster than sapply()
makeDF <- function(List, Names) {
m <- t(vapply(List,
FUN = function(X) unlist(X)[Names],
FUN.VALUE = numeric(length(Names))))
as.data.frame(m)
}
## Test timing with a 50k-item list
ll <- createList(50000)
nms <- c("a", "b", "c")
system.time(makeDF(ll, nms))
# user system elapsed
# 0.47 0.00 0.47
Here is a short answer, I doubt it will be very fast though.
> library(plyr)
> rbind.fill(lapply(x, as.data.frame))
a b c
1 1 3 5
2 3 NA 2
If you know the possible values beforehand, and you are dealing with large data, perhaps using data.table
and set
will be fast
cc <- createList(50000)
system.time({
nas <- rep.int(NA_real_, length(cc))
DT <- setnames(as.data.table(replicate(length(ids),nas, simplify = FALSE)), ids)
for(xx in seq_along(cc)){
.n <- names(cc[[xx]])
for(j in .n){
set(DT, i = xx, j = j, value = cc[[xx]][[j]])
}
}
})
# user system elapsed
# 0.68 0.01 0.70
full <- c('a','b', 'c')
system.time({
for(xx in seq_along(cc)) {
mm <- setdiff(full, names(cc[[xx]]))
if(length(mm) || all(names(cc[[xx]]) == full)){
cc[[xx]] <- as.data.table(cc[[xx]])
# any missing columns
if(length(mm)){
# if required add additional columns
cc[[xx]][, (mm) := as.list(rep(NA_real_, length(mm)))]
}
# put columns in correct order
setcolorder(cc[[xx]], full)
}
}
cdt <- rbindlist(cc)
})
# user system elapsed
# 21.83 0.06 22.00
This second solution has been left here to show how data.table
can be used poorly.
I know this is an old question, but I just came across it and it's excruciating not to see the simplest solution I'm aware of. So here it is (simply specify 'fill=TRUE' in rbindlist):
library(data.table)
list = list(list(a=1,b=3,c=5),list(c=2,a=3))
rbindlist(list,fill=TRUE)
# a b c
# 1: 1 3 5
# 2: 3 NA 2
I don't know if this is the fastest way, but I'd be willing to bet that it competes, given data.table's thoughtful design and extremely good performance on a lot of other tasks.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With