I have a list of dataframes for which I am certain that they all contain at least one row (in fact, some contain only one row, and others contain a given number of rows), and that they all have the same columns (names and types). In case it matters, I am also certain that there are no NA's anywhere in the rows. The situation can be simulated like this: <pre class="prettyprint"><code>#create one row onerowdfr<-do.call(data.frame, c(list(), rnorm(100) , lapply(sample(letters[1:2], 100, replace=TRUE), function(x){factor(x, levels=letters[1:2])}))) colnames(onerowdfr)<-c(paste("cnt", 1:100, sep=""), paste("cat", 1:100, sep="")) #reuse it in a list someParts<-lapply(rbinom(200, 1, 14/200)*6+1, function(reps){onerowdfr[rep(1, reps),]}) </code></pre> I've set the parameters (of the randomization) so that they approximate my true situation. Now, I want to unite all these dataframes in one dataframe. I thought using rbind would do the trick, like this: <pre class="prettyprint"><code>system.time( result<-do.call(rbind, someParts) ) </code></pre> Now, on my system (which is not particularly slow), and with the settings above, this takes is the output of the system.time: <pre class="prettyprint"><code> user system elapsed 5.61 0.00 5.62 </code></pre> Nearly 6 seconds for rbind-ing 254 (in my case) rows of 200 variables? Surely there has to be a way to improve the performance here? In my code, I have to do similar things very often (it is a from of multiple imputation), so I need this to be as fast as possible.

Can you build your matrices with numeric variables only and convert to a factor at the end? <code>rbind</code> is a lot faster on numeric matrices. On my system, using data frames: <pre class="prettyprint"><code>> system.time(result<-do.call(rbind, someParts)) user system elapsed 2.628 0.000 2.636 </code></pre> Building the list with all numeric matrices instead: <pre class="prettyprint"><code>onerowdfr2 <- matrix(as.numeric(onerowdfr), nrow=1) someParts2<-lapply(rbinom(200, 1, 14/200)*6+1, function(reps){onerowdfr2[rep(1, reps),]}) </code></pre> results in a lot faster <code>rbind</code>. <pre class="prettyprint"><code>> system.time(result2<-do.call(rbind, someParts2)) user system elapsed 0.001 0.000 0.001 </code></pre> EDIT: Here's another possibility; it just combines each column in turn. <pre class="prettyprint"><code>> system.time({ + n <- 1:ncol(someParts[[1]]) + names(n) <- names(someParts[[1]]) + result <- as.data.frame(lapply(n, function(i) + unlist(lapply(someParts, `[[`, i)))) + }) user system elapsed 0.810 0.000 0.813 </code></pre> Still not nearly as fast as using matrices though. EDIT 2: If you only have numerics and factors, it's not that hard to convert everything to numeric, <code>rbind</code> them, and convert the necessary columns back to factors. This assumes all factors have exactly the same levels. Converting to a factor from an integer is also faster than from a numeric so I force to integer first. <pre class="prettyprint"><code>someParts2 <- lapply(someParts, function(x) matrix(unlist(x), ncol=ncol(x))) result<-as.data.frame(do.call(rbind, someParts2)) a <- someParts[[1]] f <- which(sapply(a, class)=="factor") for(i in f) { lev <- levels(a[[i]]) result[[i]] <- factor(as.integer(result[[i]]), levels=seq_along(lev), labels=lev) } </code></pre> The timing on my system is: <pre class="prettyprint"><code> user system elapsed 0.090 0.00 0.091 </code></pre>

Not a huge boost, but swapping <code>rbind</code> for <code>rbind.fill</code> from the <code>plyr</code> package knocks about 10% off the running time (with the sample dataset, on my machine).

If you really want to manipulate your <code>data.frame</code>s faster, I would suggest to use the package <code>data.table</code> and the function <code>rbindlist()</code>. I did not perform extensive tests but for my dataset (3000 dataframes, 1000 rows x 40 columns each) <code>rbindlist()</code> takes only 20 seconds.

This is ~25% faster, but there has to be a better way... <pre class="prettyprint"><code>system.time({ N <- do.call(sum, lapply(someParts, nrow)) SP <- as.data.frame(lapply(someParts[[1]], function(x) rep(x,N))) k <- 0 for(i in 1:length(someParts)) { j <- k+1 k <- k + nrow(someParts[[i]]) SP[j:k,] <- someParts[[i]] } }) </code></pre>

Performance of rbind.data.frame

Tags:

performance

dataframe

r

rbind

I have a list of dataframes for which I am certain that they all contain at least one row (in fact, some contain only one row, and others contain a given number of rows), and that they all have the same columns (names and types). In case it matters, I am also certain that there are no NA's anywhere in the rows.

The situation can be simulated like this:

#create one row
onerowdfr<-do.call(data.frame, c(list(), rnorm(100) , lapply(sample(letters[1:2], 100, replace=TRUE), function(x){factor(x, levels=letters[1:2])})))
colnames(onerowdfr)<-c(paste("cnt", 1:100, sep=""), paste("cat", 1:100, sep=""))
#reuse it in a list
someParts<-lapply(rbinom(200, 1, 14/200)*6+1, function(reps){onerowdfr[rep(1, reps),]})

I've set the parameters (of the randomization) so that they approximate my true situation.

Now, I want to unite all these dataframes in one dataframe. I thought using rbind would do the trick, like this:

system.time(
result<-do.call(rbind, someParts)
)

Now, on my system (which is not particularly slow), and with the settings above, this takes is the output of the system.time:

   user  system elapsed 
   5.61    0.00    5.62

Nearly 6 seconds for rbind-ing 254 (in my case) rows of 200 variables? Surely there has to be a way to improve the performance here? In my code, I have to do similar things very often (it is a from of multiple imputation), so I need this to be as fast as possible.

444

asked May 12 '11 15:05

Nick Sabbe

5 Answers

Can you build your matrices with numeric variables only and convert to a factor at the end? rbind is a lot faster on numeric matrices.

On my system, using data frames:

> system.time(result<-do.call(rbind, someParts))
   user  system elapsed 
  2.628   0.000   2.636

Building the list with all numeric matrices instead:

onerowdfr2 <- matrix(as.numeric(onerowdfr), nrow=1)
someParts2<-lapply(rbinom(200, 1, 14/200)*6+1, 
                   function(reps){onerowdfr2[rep(1, reps),]})

results in a lot faster rbind.

> system.time(result2<-do.call(rbind, someParts2))
   user  system elapsed 
  0.001   0.000   0.001

EDIT: Here's another possibility; it just combines each column in turn.

> system.time({
+   n <- 1:ncol(someParts[[1]])
+   names(n) <- names(someParts[[1]])
+   result <- as.data.frame(lapply(n, function(i) 
+                           unlist(lapply(someParts, `[[`, i))))
+ })
   user  system elapsed 
  0.810   0.000   0.813

Still not nearly as fast as using matrices though.

EDIT 2:

If you only have numerics and factors, it's not that hard to convert everything to numeric, rbind them, and convert the necessary columns back to factors. This assumes all factors have exactly the same levels. Converting to a factor from an integer is also faster than from a numeric so I force to integer first.

someParts2 <- lapply(someParts, function(x)
                     matrix(unlist(x), ncol=ncol(x)))
result<-as.data.frame(do.call(rbind, someParts2))
a <- someParts[[1]]
f <- which(sapply(a, class)=="factor")
for(i in f) {
  lev <- levels(a[[i]])
  result[[i]] <- factor(as.integer(result[[i]]), levels=seq_along(lev), labels=lev)
}

The timing on my system is:

   user  system elapsed 
   0.090    0.00    0.091

answered Oct 25 '22 05:10

Aaron left Stack Overflow

Not a huge boost, but swapping rbind for rbind.fill from the plyr package knocks about 10% off the running time (with the sample dataset, on my machine).

answered Oct 25 '22 05:10

Richie Cotton

If you really want to manipulate your data.frames faster, I would suggest to use the package data.table and the function rbindlist(). I did not perform extensive tests but for my dataset (3000 dataframes, 1000 rows x 40 columns each) rbindlist() takes only 20 seconds.

answered Oct 25 '22 04:10

Daniele

This is ~25% faster, but there has to be a better way...

system.time({
  N <- do.call(sum, lapply(someParts, nrow))
  SP <- as.data.frame(lapply(someParts[[1]], function(x) rep(x,N)))
  k <- 0
  for(i in 1:length(someParts)) {
    j <- k+1
    k <- k + nrow(someParts[[i]])
    SP[j:k,] <- someParts[[i]]
  }
})

answered Oct 25 '22 04:10

Joshua Ulrich

Make sure you're binding dataframe to dataframe. Ran into huge perf degradation when binding list to dataframe.

answered Oct 25 '22 03:10

Cameron Turner

Related questions
                            
                                haskell list comprehension performance
                            
                                Why the first call to constructor takes 10 times more time than other ones?
                            
                                Delphi Adding Items to ComboBox Speed
                            
                                SQL server concurrent accessing
                            
                                32-bit versus 64-bit floating-point performance
                            
                                Would lots of unnecessary variables cause performance issues in C#?
                            
                                JMX vs VisualVM?
                            
                                R - mgsub problem: substrings being replaced not whole strings
                            
                                subselect vs outer join
                            
                                [Optimize This]: Slow LINQ to Objects Query
                            
                                For Loop or While Loop - Efficiency
                            
                                In Scala, why does my Sieve algorithm runs so slowly?
                            
                                knockout observableArray performance
                            
                                Fastest and most efficient way to create XML
                            
                                Is it normal that the gcc atomic builtins are so slow?
                            
                                Removing first slash character from string PHP [closed]
                            
                                Efficient Number of Threads
                            
                                How can I optimize views in SQL Server for speed
                            
                                C: Is the inline keyword worth it?
                            
                                LAPACK/BLAS versus simple "for" loops

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With