Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Trouble converting long list of data.frames (~1 million) to single data.frame using do.call and ldply

Tags:

I know there are many questions here in SO about ways to convert a list of data.frames to a single data.frame using do.call or ldply, but this questions is about understanding the inner workings of both methods and trying to figure out why I can't get either to work for concatenating a list of almost 1 million df's of the same structure, same field names, etc. into a single data.frame. Each data.frame is of one row and 21 columns.

The data started out as a JSON file, which I converted to lists using fromJSON, then ran another lapply to extract part of the list and converted to data.frame and ended up with a list of data.frames.

I've tried:

df <- do.call("rbind", list) df <- ldply(list) 

but I've had to kill the process after letting it run up to 3 hours and not getting anything back.

Is there a more efficient method of doing this? How can I troubleshoot what is happening and why is it taking so long?

FYI - I'm using RStudio server on a 72GB quad-core server with RHEL, so I don't think memory is the problem. sessionInfo below:

> sessionInfo() R version 2.14.1 (2011-12-22) Platform: x86_64-redhat-linux-gnu (64-bit)  locale:  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C                [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8      [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8     [7] LC_PAPER=C                 LC_NAME=C                   [9] LC_ADDRESS=C               LC_TELEPHONE=C             [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C         attached base packages: [1] stats     graphics  grDevices utils     datasets  methods   base       other attached packages: [1] multicore_0.1-7 plyr_1.7.1      rjson_0.2.6      loaded via a namespace (and not attached): [1] tools_2.14.1 >  
like image 536
wahalulu Avatar asked Mar 15 '12 21:03

wahalulu


1 Answers

Given that you are looking for performance, it appears that a data.table solution should be suggested.

There is a function rbindlist which is the same but much faster than do.call(rbind, list)

library(data.table) X <- replicate(50000, data.table(a=rnorm(5), b=1:5), simplify=FALSE) system.time(rbindlist.data.table <- rbindlist(X)) ##  user  system elapsed  ##  0.00    0.01    0.02 

It is also very fast for a list of data.frame

Xdf <- replicate(50000, data.frame(a=rnorm(5), b=1:5), simplify=FALSE)  system.time(rbindlist.data.frame <- rbindlist(Xdf)) ##  user  system elapsed  ##  0.03    0.00    0.03 

For comparison

system.time(docall <- do.call(rbind, Xdf)) ##  user  system elapsed  ## 50.72    9.89   60.88  

And some proper benchmarking

library(rbenchmark) benchmark(rbindlist.data.table = rbindlist(X),             rbindlist.data.frame = rbindlist(Xdf),            docall = do.call(rbind, Xdf),            replications = 5) ##                   test replications elapsed    relative user.self sys.self  ## 3               docall            5  276.61 3073.444445    264.08     11.4  ## 2 rbindlist.data.frame            5    0.11    1.222222      0.11      0.0  ## 1 rbindlist.data.table            5    0.09    1.000000      0.09      0.0  

and against @JoshuaUlrich's solutions

benchmark(use.rbl.dt  = rbl.dt(X),            use.rbl.ju  = rbl.ju (Xdf),           use.rbindlist =rbindlist(X) ,           replications = 5)  ##              test replications elapsed relative user.self  ## 3  use.rbindlist            5    0.10      1.0      0.09 ## 1     use.rbl.dt            5    0.10      1.0      0.09 ## 2     use.rbl.ju            5    0.33      3.3      0.31  

I'm not sure you really need to use as.data.frame, because a data.table inherits class data.frame

like image 199
mnel Avatar answered Sep 28 '22 17:09

mnel