Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fast vectorized merge of list of data.frames by row

Most of the questions about merging data.frame in lists on SO don't quite relate to what I'm trying to get across here, but feel free to prove me wrong.

I have a list of data.frames. I would like to "rbind" rows into another data.frame by row. In essence, all first rows form one data.frame, second rows second data.frame and so on. Result would be a list of the same length as the number of rows in my original data.frame(s). So far, the data.frames are identical in dimensions.

Here's some data to play around with.

sample.list <- list(data.frame(x = sample(1:100, 10), y = sample(1:100, 10), capt = sample(0:1, 10, replace = TRUE)),         data.frame(x = sample(1:100, 10), y = sample(1:100, 10), capt = sample(0:1, 10, replace = TRUE)),         data.frame(x = sample(1:100, 10), y = sample(1:100, 10), capt = sample(0:1, 10, replace = TRUE)),         data.frame(x = sample(1:100, 10), y = sample(1:100, 10), capt = sample(0:1, 10, replace = TRUE)),         data.frame(x = sample(1:100, 10), y = sample(1:100, 10), capt = sample(0:1, 10, replace = TRUE)),         data.frame(x = sample(1:100, 10), y = sample(1:100, 10), capt = sample(0:1, 10, replace = TRUE)),         data.frame(x = sample(1:100, 10), y = sample(1:100, 10), capt = sample(0:1, 10, replace = TRUE))) 

Here's what I've come up with with the good ol' for loop.

#solution 1 my.list <- vector("list", nrow(sample.list[[1]])) for (i in 1:nrow(sample.list[[1]])) {     for (j in 1:length(sample.list)) {         my.list[[i]] <- rbind(my.list[[i]], sample.list[[j]][i, ])     } }  #solution 2 (so far my favorite) sample.list2 <- do.call("rbind", sample.list) my.list2 <- vector("list", nrow(sample.list[[1]]))  for (i in 1:nrow(sample.list[[1]])) {     my.list2[[i]] <- sample.list2[seq(from = i, to = nrow(sample.list2), by = nrow(sample.list[[1]])), ] } 

Can this be improved using vectorization without much brainhurt? Correct answer will contain a snippet of code, of course. "Yes" as an answer doesn't count.

EDIT

#solution 3 (a variant of solution 2 above) ind <- rep(1:nrow(sample.list[[1]]), times = length(sample.list)) my.list3 <- split(x = sample.list2, f = ind) 

BENCHMARKING

I've made my list larger with more rows per data.frame. I've benchmarked the results which are as follows:

#solution 1 system.time(for (i in 1:nrow(sample.list[[1]])) {     for (j in 1:length(sample.list)) {         my.list[[i]] <- rbind(my.list[[i]], sample.list[[j]][i, ])     } })    user  system elapsed   80.989   0.004  81.210   # solution 2 system.time(for (i in 1:nrow(sample.list[[1]])) {     my.list2[[i]] <- sample.list2[seq(from = i, to = nrow(sample.list2), by = nrow(sample.list[[1]])), ] })    user  system elapsed    0.957   0.160   1.126   # solution 3 system.time(split(x = sample.list2, f = ind))    user  system elapsed    1.104   0.204   1.332   # solution Gabor system.time(lapply(1:nr, bind.ith.rows))    user  system elapsed    0.484   0.000   0.485   # solution ncray system.time(alply(do.call("cbind",sample.list), 1,                 .fun=matrix, ncol=ncol(sample.list[[1]]), byrow=TRUE,                 dimnames=list(1:length(sample.list),names(sample.list[[1]]))))    user  system elapsed   11.296   0.016  11.365 
like image 695
Roman Luštrik Avatar asked Feb 01 '11 13:02

Roman Luštrik


People also ask

Is Rbind slow?

The function rbind() is slow, particularly as the data frame gets bigger. You should never use it in a loop. The right way to do it is to initialize the output object at its final size right from the start and then simply fill it in with each turn of the loop.

Which command is used to merge two data frames?

The concat() function can be used to concatenate two Dataframes by adding the rows of one to the other. The merge() function is equivalent to the SQL JOIN clause.


1 Answers

Try this:

bind.ith.rows <- function(i) do.call(rbind, lapply(sample.list, "[", i, TRUE)) nr <- nrow(sample.list[[1]]) lapply(1:nr, bind.ith.rows) 
like image 164
G. Grothendieck Avatar answered Oct 20 '22 00:10

G. Grothendieck