As I am sitting here waiting for some R scripts to run...I was wondering... is there any way to parallelize rbind in R?
I sitting waiting for this call to complete frequently as I deal with large amounts of data.
do.call("rbind", LIST)
Example 3: rbind fill – Row Bind with Missing Columns The binding of data frames with different columns / column names is a bit more complicated with the rbind function. R usually returns the error “Error in match. names(clabs, names(xi))”, if you try to use the rbind function for data frames with different columns.
The function rbind() is slow, particularly as the data frame gets bigger. You should never use it in a loop. The right way to do it is to initialize the output object at its final size right from the start and then simply fill it in with each turn of the loop.
As you know that rbind() function in R used to bind the rows of different groups of data. In this section, let's try to construct a simple data frames and bind them using rbind() function. The above code will construct a simple data frame presenting student details and names.
As discussed at the link I cite, the reason for the slowness is that each time you add a row, R needs to find a new contiguous block of memory to fit the data frame in.
I haven't found a way to do this in parallel either thus far. However for my dataset (this one is a list of about 1500 dataframes totaling 4.5M rows) the following snippet seemed to help:
while(length(lst) > 1) { idxlst <- seq(from=1, to=length(lst), by=2) lst <- lapply(idxlst, function(i) { if(i==length(lst)) { return(lst[[i]]) } return(rbind(lst[[i]], lst[[i+1]])) }) }
where lst is the list. It seemed to be about 4 times faster than using do.call(rbind, lst)
or even do.call(rbind.fill, lst)
(with rbind.fill from the plyr package). In each iteration this code is halving the amount of dataframes.
Because you said that you want to rbind data.frame
objects you should use the data.table
package. It has a function called rbindlist
that enhance drastically rbind
. I am not 100% sure but I would bet any use of rbind
would trigger a copy when rbindlist
does not. Anyway a data.table
is a data.frame
so you do not loose anything to try.
EDIT:
library(data.table) system.time(dt <- rbindlist(pieces)) utilisateur système écoulé 0.12 0.00 0.13 tables() NAME NROW MB COLS KEY [1,] dt 1,000 8 X1,X2,X3,X4,X5,X6,X7,X8,... Total: 8MB
Lightning fast...
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With