Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

High-performance big data manipulation in R

I am dealing with a collection of lists, which contain deeply nested lists with no fixed structure other than the fact that:

  1. The lists at level 1 have a single element called variations
  2. All leaf data in the hierarchy is numeric.

For example:

list(
  list(variations = list(
    '12'   = list(x = c(a = 1))
    )),
  list(variations = list(
    '3'    = list(x = c(a = 6, b = 4)),
    'abcd' = list(x = c(b = 1), m = list(n = list(o = c(p = 1023))))
    ))
  )

I need to convert the list data structure into a melted (per reshape) dataframe of the form

data.frame(
  variation = c( '12',   '3',   '3', 'abcd',    'abcd'),
  variable  = c('x.a', 'x.a', 'x.b',  'x.b', 'm.n.o.p'),
  value     = c(    1,     6,     4,      1,      1023)
  )

or another data structure I can perform fast grouping and filtering on.

There are many millions of nodes in the data structure. The collection can have thousands of entries and each entry has tens of thousands of variations with 2-10+ leaf nodes with unknown names.

I am looking for suggestions on how to build the dataframe from the collection in a fast way.

One approach would be to use unlist on the source data to flatten the lists but I am not sure about the following:

  • Should I run unlist on the whole data structure, which will convert the leaf numeric nodes to strings (which I will then need to parse back into numerics) or should I use unlist on each variation (which will leave the numeric leaf nodes intact)?

  • What's a good way to parse the long names that unlist will create to extract variation and variable values without generating too many intermediate values?

Regardless of whether unlist is the right way to go, I'm wondering:

  • Is it better to built separate variation, variable and value vectors or a matrix and then combine them into a dataframe as opposed to build the dataframe row-by-row?

  • Should I not be using dataframes but another, faster, data structure for dealing with this type of data? Whatever I end up using needs to be convertible to dataframes for use with plyr, reshape and ggplot.

like image 469
Sim Avatar asked Oct 22 '22 21:10

Sim


1 Answers

There's a function that doesn't seem to get used much called rapply which recursively operates on lists. I have no idea how fast it is (based on lapply, so probably not terrible but not amazing), and it's tricky to use. But worth considering, if only for elegance.

Here's one basic example of its use:

> rapply( test, classes="numeric", how="unlist", f=function(var) data.frame(names(var),var) )
      variations.12.x.names.var.              variations.12.x.var       variations.3.x.names.var.1       variations.3.x.names.var.2              variations.3.x.var1 
                             "a"                              "1"                              "a"                              "b"                              "6" 
             variations.3.x.var2     variations.abcd.x.names.var.            variations.abcd.x.var variations.abcd.m.n.o.names.var.        variations.abcd.m.n.o.var 
                             "4"                              "b"                              "1"                              "p"                           "1023" 
like image 114
Ari B. Friedman Avatar answered Oct 27 '22 10:10

Ari B. Friedman