I know there are many questions here in SO about ways to convert a list of data.frames to a single data.frame using do.call or ldply, but this questions is about understanding the inner workings of both methods and trying to figure out why I can't get either to work for concatenating a list of almost 1 million df's of the same structure, same field names, etc. into a single data.frame. Each data.frame is of one row and 21 columns.
The data started out as a JSON file, which I converted to lists using fromJSON, then ran another lapply to extract part of the list and converted to data.frame and ended up with a list of data.frames.
I've tried:
df <- do.call("rbind", list) df <- ldply(list)
but I've had to kill the process after letting it run up to 3 hours and not getting anything back.
Is there a more efficient method of doing this? How can I troubleshoot what is happening and why is it taking so long?
FYI - I'm using RStudio server on a 72GB quad-core server with RHEL, so I don't think memory is the problem. sessionInfo below:
> sessionInfo() R version 2.14.1 (2011-12-22) Platform: x86_64-redhat-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8 [7] LC_PAPER=C LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] multicore_0.1-7 plyr_1.7.1 rjson_0.2.6 loaded via a namespace (and not attached): [1] tools_2.14.1 >
Given that you are looking for performance, it appears that a data.table
solution should be suggested.
There is a function rbindlist
which is the same
but much faster than do.call(rbind, list)
library(data.table) X <- replicate(50000, data.table(a=rnorm(5), b=1:5), simplify=FALSE) system.time(rbindlist.data.table <- rbindlist(X)) ## user system elapsed ## 0.00 0.01 0.02
It is also very fast for a list of data.frame
Xdf <- replicate(50000, data.frame(a=rnorm(5), b=1:5), simplify=FALSE) system.time(rbindlist.data.frame <- rbindlist(Xdf)) ## user system elapsed ## 0.03 0.00 0.03
For comparison
system.time(docall <- do.call(rbind, Xdf)) ## user system elapsed ## 50.72 9.89 60.88
And some proper benchmarking
library(rbenchmark) benchmark(rbindlist.data.table = rbindlist(X), rbindlist.data.frame = rbindlist(Xdf), docall = do.call(rbind, Xdf), replications = 5) ## test replications elapsed relative user.self sys.self ## 3 docall 5 276.61 3073.444445 264.08 11.4 ## 2 rbindlist.data.frame 5 0.11 1.222222 0.11 0.0 ## 1 rbindlist.data.table 5 0.09 1.000000 0.09 0.0
benchmark(use.rbl.dt = rbl.dt(X), use.rbl.ju = rbl.ju (Xdf), use.rbindlist =rbindlist(X) , replications = 5) ## test replications elapsed relative user.self ## 3 use.rbindlist 5 0.10 1.0 0.09 ## 1 use.rbl.dt 5 0.10 1.0 0.09 ## 2 use.rbl.ju 5 0.33 3.3 0.31
I'm not sure you really need to use as.data.frame
, because a data.table
inherits class data.frame
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With