I am dealing with a collection of lists, which contain deeply nested lists with no fixed structure other than the fact that:
variations
For example:
list(
list(variations = list(
'12' = list(x = c(a = 1))
)),
list(variations = list(
'3' = list(x = c(a = 6, b = 4)),
'abcd' = list(x = c(b = 1), m = list(n = list(o = c(p = 1023))))
))
)
I need to convert the list data structure into a melted (per reshape
) dataframe of the form
data.frame(
variation = c( '12', '3', '3', 'abcd', 'abcd'),
variable = c('x.a', 'x.a', 'x.b', 'x.b', 'm.n.o.p'),
value = c( 1, 6, 4, 1, 1023)
)
or another data structure I can perform fast grouping and filtering on.
There are many millions of nodes in the data structure. The collection can have thousands of entries and each entry has tens of thousands of variations with 2-10+ leaf nodes with unknown names.
I am looking for suggestions on how to build the dataframe from the collection in a fast way.
One approach would be to use unlist
on the source data to flatten the lists but I am not sure about the following:
Should I run unlist
on the whole data structure, which will convert the leaf numeric nodes to strings (which I will then need to parse back into numerics) or should I use unlist
on each variation (which will leave the numeric leaf nodes intact)?
What's a good way to parse the long names that unlist
will create to extract variation
and variable
values without generating too many intermediate values?
Regardless of whether unlist
is the right way to go, I'm wondering:
Is it better to built separate variation
, variable
and value
vectors or a matrix and then combine them into a dataframe as opposed to build the dataframe row-by-row?
Should I not be using dataframes but another, faster, data structure for dealing with this type of data? Whatever I end up using needs to be convertible to dataframes for use with plyr
, reshape
and ggplot
.
There's a function that doesn't seem to get used much called rapply
which recursively operates on lists. I have no idea how fast it is (based on lapply
, so probably not terrible but not amazing), and it's tricky to use. But worth considering, if only for elegance.
Here's one basic example of its use:
> rapply( test, classes="numeric", how="unlist", f=function(var) data.frame(names(var),var) )
variations.12.x.names.var. variations.12.x.var variations.3.x.names.var.1 variations.3.x.names.var.2 variations.3.x.var1
"a" "1" "a" "b" "6"
variations.3.x.var2 variations.abcd.x.names.var. variations.abcd.x.var variations.abcd.m.n.o.names.var. variations.abcd.m.n.o.var
"4" "b" "1" "p" "1023"
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With