Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can rbind be parallelized in R?

Tags:

r

As I am sitting here waiting for some R scripts to run...I was wondering... is there any way to parallelize rbind in R?

I sitting waiting for this call to complete frequently as I deal with large amounts of data.

do.call("rbind", LIST) 
like image 884
Atlas1j Avatar asked Aug 29 '11 00:08

Atlas1j


People also ask

What does Rbind return in R?

Example 3: rbind fill – Row Bind with Missing Columns The binding of data frames with different columns / column names is a bit more complicated with the rbind function. R usually returns the error “Error in match. names(clabs, names(xi))”, if you try to use the rbind function for data frames with different columns.

Is Rbind slow in R?

The function rbind() is slow, particularly as the data frame gets bigger. You should never use it in a loop. The right way to do it is to initialize the output object at its final size right from the start and then simply fill it in with each turn of the loop.

What is use of Rbind function in R?

As you know that rbind() function in R used to bind the rows of different groups of data. In this section, let's try to construct a simple data frames and bind them using rbind() function. The above code will construct a simple data frame presenting student details and names.

Why is Rbind slow?

As discussed at the link I cite, the reason for the slowness is that each time you add a row, R needs to find a new contiguous block of memory to fit the data frame in.


2 Answers

I haven't found a way to do this in parallel either thus far. However for my dataset (this one is a list of about 1500 dataframes totaling 4.5M rows) the following snippet seemed to help:

while(length(lst) > 1) {     idxlst <- seq(from=1, to=length(lst), by=2)      lst <- lapply(idxlst, function(i) {         if(i==length(lst)) { return(lst[[i]]) }          return(rbind(lst[[i]], lst[[i+1]]))     }) } 

where lst is the list. It seemed to be about 4 times faster than using do.call(rbind, lst) or even do.call(rbind.fill, lst) (with rbind.fill from the plyr package). In each iteration this code is halving the amount of dataframes.

like image 90
Dominik Avatar answered Sep 22 '22 05:09

Dominik


Because you said that you want to rbind data.frame objects you should use the data.table package. It has a function called rbindlist that enhance drastically rbind. I am not 100% sure but I would bet any use of rbind would trigger a copy when rbindlist does not. Anyway a data.table is a data.frame so you do not loose anything to try.

EDIT:

library(data.table) system.time(dt <- rbindlist(pieces)) utilisateur     système      écoulé         0.12        0.00        0.13  tables()      NAME  NROW MB COLS                        KEY [1,] dt   1,000 8  X1,X2,X3,X4,X5,X6,X7,X8,...     Total: 8MB 

Lightning fast...

like image 36
statquant Avatar answered Sep 20 '22 05:09

statquant