High-performance big data manipulation in R

Question

I am dealing with a collection of lists, which contain deeply nested lists with no fixed structure other than the fact that:

The lists at level 1 have a single element called variations
All leaf data in the hierarchy is numeric.

For example:

list(
  list(variations = list(
    '12'   = list(x = c(a = 1))
    )),
  list(variations = list(
    '3'    = list(x = c(a = 6, b = 4)),
    'abcd' = list(x = c(b = 1), m = list(n = list(o = c(p = 1023))))
    ))
  )

I need to convert the list data structure into a melted (per reshape) dataframe of the form

data.frame(
  variation = c( '12',   '3',   '3', 'abcd',    'abcd'),
  variable  = c('x.a', 'x.a', 'x.b',  'x.b', 'm.n.o.p'),
  value     = c(    1,     6,     4,      1,      1023)
  )

or another data structure I can perform fast grouping and filtering on.

There are many millions of nodes in the data structure. The collection can have thousands of entries and each entry has tens of thousands of variations with 2-10+ leaf nodes with unknown names.

I am looking for suggestions on how to build the dataframe from the collection in a fast way.

One approach would be to use unlist on the source data to flatten the lists but I am not sure about the following:

Should I run unlist on the whole data structure, which will convert the leaf numeric nodes to strings (which I will then need to parse back into numerics) or should I use unlist on each variation (which will leave the numeric leaf nodes intact)?
What's a good way to parse the long names that unlist will create to extract variation and variable values without generating too many intermediate values?

Regardless of whether unlist is the right way to go, I'm wondering:

Is it better to built separate variation, variable and value vectors or a matrix and then combine them into a dataframe as opposed to build the dataframe row-by-row?
Should I not be using dataframes but another, faster, data structure for dealing with this type of data? Whatever I end up using needs to be convertible to dataframes for use with plyr, reshape and ggplot.

Ari B. Friedman · Accepted Answer

There's a function that doesn't seem to get used much called rapply which recursively operates on lists. I have no idea how fast it is (based on lapply, so probably not terrible but not amazing), and it's tricky to use. But worth considering, if only for elegance.

Here's one basic example of its use:

> rapply( test, classes="numeric", how="unlist", f=function(var) data.frame(names(var),var) )
      variations.12.x.names.var.              variations.12.x.var       variations.3.x.names.var.1       variations.3.x.names.var.2              variations.3.x.var1 
                             "a"                              "1"                              "a"                              "b"                              "6" 
             variations.3.x.var2     variations.abcd.x.names.var.            variations.abcd.x.var variations.abcd.m.n.o.names.var.        variations.abcd.m.n.o.var 
                             "4"                              "b"                              "1"                              "p"                           "1023"

High-performance big data manipulation in R

Tags:

performance

dataframe

r

bigdata

reshape2

Sim

1 Answers

Ari B. Friedman

Recent Activity

Donate For Us

High-performance big data manipulation in R

Tags:

performance

dataframe

r

bigdata

reshape2

Sim

1 Answers

Ari B. Friedman

Related questions

Recent Activity

Donate For Us