I have an operation I'd like to run for each row of a data frame, changing one column. I'm an apply/ddply/sqldf man, but I'll use loops when they make sense, and I think this is one of those times. This case is tricky because the column to changes depends on information that changes by row; depending on information in one cell, I should make a change to only one of ten other cells in that row. With 75 columns and 20000 rows, the operation takes 10 minutes, when every other operation in my script takes 0-5 seconds, ten seconds max. I've stripped my problem down to the very simple test case below. <pre class="prettyprint"><code>n <- 20000 t.df <- data.frame(matrix(1:5000, ncol=10, nrow=n) ) system.time( for (i in 1:nrow(t.df)) { t.df[i,(t.df[i,1]%%10 + 1)] <- 99 } ) </code></pre> This takes 70 seconds with ten columns, and 360 when ncol=50. That's crazy. Are loops the wrong approach? Is there a better, more efficient way to do this? I already tried initializing the nested term (t.df[i,1]%%10 + 1) as a list outside the for loop. It saves about 30 seconds (out of 10 minutes) but makes the example code above more complicated. So it helps, but its not the solution. My current best idea came while preparing this test case. For me, only 10 of the columns are relevant (and 75-11 columns are irrelevant). Since the run times depend so much on the number of columns, I can just run the above operation on a data frame that excludes irrelevant columns. That will get me down to just over a minute. But is "for loop with nested indices" even the best way to think about my problem?

It seems the real bottleneck is having the data in the form of a data.frame. I assume that in your real problem you have a compelling reason to use a data.frame. Any way to convert your data in such a way that it can remain in a matrix? By the way, great question and a very good example. Here's an illustration of how much faster loops are on matrices than on data.frames: <pre class="prettyprint"><code>> n <- 20000 > t.df <- (matrix(1:5000, ncol=10, nrow=n) ) > system.time( + for (i in 1:nrow(t.df)) { + t.df[i,(t.df[i,1]%%10 + 1)] <- 99 + } + ) user system elapsed 0.084 0.001 0.084 > > n <- 20000 > t.df <- data.frame(matrix(1:5000, ncol=10, nrow=n) ) > system.time( + for (i in 1:nrow(t.df)) { + t.df[i,(t.df[i,1]%%10 + 1)] <- 99 + } + ) user system elapsed 31.543 57.664 89.224 </code></pre>

Using <code>row</code> and <code>col</code> seems less complicated to me: <pre class="prettyprint"><code>t.df[col(t.df) == (row(t.df) %% 10) + 1] <- 99 </code></pre> I think Tommy's is still faster, but using <code>row</code> and <code>col</code> might be easier to understand.

r: for loop operation with nested indices runs super slow

Tags:

performance

for-loop

r

vectorization

nested

I have an operation I'd like to run for each row of a data frame, changing one column. I'm an apply/ddply/sqldf man, but I'll use loops when they make sense, and I think this is one of those times. This case is tricky because the column to changes depends on information that changes by row; depending on information in one cell, I should make a change to only one of ten other cells in that row. With 75 columns and 20000 rows, the operation takes 10 minutes, when every other operation in my script takes 0-5 seconds, ten seconds max. I've stripped my problem down to the very simple test case below.

n <- 20000
t.df <- data.frame(matrix(1:5000, ncol=10, nrow=n) )
system.time(
 for (i in 1:nrow(t.df)) {
 t.df[i,(t.df[i,1]%%10 + 1)] <- 99
 }
)

This takes 70 seconds with ten columns, and 360 when ncol=50. That's crazy. Are loops the wrong approach? Is there a better, more efficient way to do this?

I already tried initializing the nested term (t.df[i,1]%%10 + 1) as a list outside the for loop. It saves about 30 seconds (out of 10 minutes) but makes the example code above more complicated. So it helps, but its not the solution.

My current best idea came while preparing this test case. For me, only 10 of the columns are relevant (and 75-11 columns are irrelevant). Since the run times depend so much on the number of columns, I can just run the above operation on a data frame that excludes irrelevant columns. That will get me down to just over a minute. But is "for loop with nested indices" even the best way to think about my problem?

358

asked Nov 30 '11 18:11

enfascination

2 Answers

It seems the real bottleneck is having the data in the form of a data.frame. I assume that in your real problem you have a compelling reason to use a data.frame. Any way to convert your data in such a way that it can remain in a matrix?

By the way, great question and a very good example.

Here's an illustration of how much faster loops are on matrices than on data.frames:

> n <- 20000
> t.df <- (matrix(1:5000, ncol=10, nrow=n) )
> system.time(
+   for (i in 1:nrow(t.df)) {
+     t.df[i,(t.df[i,1]%%10 + 1)] <- 99
+   }
+ )
   user  system elapsed 
  0.084   0.001   0.084 
> 
> n <- 20000
> t.df <- data.frame(matrix(1:5000, ncol=10, nrow=n) )
> system.time(
+   for (i in 1:nrow(t.df)) {
+     t.df[i,(t.df[i,1]%%10 + 1)] <- 99
+   }
+   )
   user  system elapsed 
 31.543  57.664  89.224

171

answered Nov 02 '22 05:11

JD Long

Using row and col seems less complicated to me:

t.df[col(t.df) == (row(t.df) %% 10) + 1]  <- 99

I think Tommy's is still faster, but using row and col might be easier to understand.

answered Nov 02 '22 05:11

joran

Related questions
                            
                                Use JRuby for Ruby web applications? Is it worth it?
                            
                                Can Sun JVM handle gigantic heap sizes without problems, and how?
                            
                                How to speed up dumping a DataTable into an Excel worksheet?
                            
                                What is this cProfile result telling me I need to fix?
                            
                                .htaccess mod_rewrite performance
                            
                                Why is numpy much slower than matlab on a digitize example?
                            
                                Why does adding a polymorphic type signature degrade performance?
                            
                                Creating a numpy array of 3D coordinates from three 1D arrays
                            
                                Why does using arguments make this function so much slower?
                            
                                Efficient way to take the minimum/maximum n values and indices from a matrix using NumPy
                            
                                CouchDB view is extremely slow
                            
                                slow rails stack
                            
                                Optimal Buffer size for read-process-write
                            
                                Very slow DELETE query
                            
                                testing tic tac toe win condition [duplicate]
                            
                                Compiler optimization: g++ slower than intel
                            
                                Some Oreo devices are not getting Push Notification
                            
                                python string join performance
                            
                                Why is matrix multiplication in .NET so slow?
                            
                                Any way to speed up Python and Pygame?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With