Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Does order of loops matter for speed in R

Tags:

r

I have a problem where I want to run a simulation study where the simulation depends on two variables x and y. x and y are vectors of potential values that I want to evaluate (so different combinations) in my simulation study. Furthermore, for each combination of x and y I want multiple replicates (since there is a stochastic term in there and each run of x and y will vary).

To give an example of what I am dealing with I have the following simplified example:

x = 1:10
y = 11:20

iterations = 2000
iter = 1

solution = array(NA,c(length(x),3,iterations))
for(i in x){
    for(j in y){
        for(k in 1:iterations){ 
            z = rnorm(3) + c(i,j,1) 
            solution[i,,k] = z
        }
    }
}

However in my actual problem, the code that get evaluated inside the for loop is much less trivial to evaluate. However, the structure of my inputs are the same as well as the output.

So what I would like to know, say using the above example, is it most efficient to set up the loops in that order or would it be better to let k in 1:iterations be the outermost loop and try to use some sort of outer() command within that 1 loop since I will be evaluating the function (z in this example) over the grid x and y?

Also, I am very open to a completely different set-up and design. At the end of the day I want to be able to obtain a solution that is based on x and y and averaged over all the iterations, i.e., apply(solution, c(1,2),mean)


Edit:

As was suggested to me, here is the actual code that I am using.

library(survival)

iter = 2000
n = 120
base = 2
hr = 0.5
m.x = 3
m.y = m.x/hr

ANS = NULL
for (vecX in c(0.3, 0.5, 0.6, 0.7)){
out = NULL

    for (vecY in c(0, 0.05, 0.1, 0.15, 0.2, 0.25, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 0.95)){
        m.x.p = m.x/vecX
        m.y.p = m.y/vecX
        m.x.n = m.x
        m.y.n = m.y
        
        n.t = round(n*base/(base+1))
        n.c = n - n.t
        
        for (ii in 1:iter){
            n.t.p = rbinom(1, n.t, vecY)
            n.t.n = n.t - n.t.p
            n.c.p = rbinom(1, n.c, vecY)
            n.c.n = n.c - n.c.p
            
            S = c(rexp(n.t.p, log(2)/m.y.p), rexp(n.t.n, log(2)/m.y.n), rexp(n.c.p, log(2)/m.x.p), rexp(n.c.n, log(2)/m.x.n))
            
            data1 = data.frame(Group = c(rep("T", n.t), rep("C", n.c)), dx = c(rep("P", n.t.p), rep("N", n.t.n), rep("P", n.c.p), rep("N", n.c.n)), S)
            
            fit = survfit(Surv(data1$S)~data1$Group)
            coxfit = coxph(Surv(data1$S)~data1$Group)
            
            HR = exp(coxfit$coefficients)
            p.val=summary(coxfit)$logtest["pvalue"]
            
            out = rbind(out, c(vecX, vecY, n.t.p, n.t.n, n.c.p, n.c.n, HR, p.val))
        }
        
    }
    colnames(out) = c("vecX", "vecY", "n.t.p", "n.t.n", "n.c.p", "n.c.n", "HR", "p.val")
    
    ans = as.data.frame(out)
ANS = rbind(ANS, ans)

}
like image 762
RustyStatistician Avatar asked Mar 24 '16 18:03

RustyStatistician


People also ask

Does the order of for loops matter?

It absolutely can do, if you make a loop that causes an array to be accessed out of order: for( i=0; i<cols; i++ ) for( j=0; j<rows; j++ )

Which order is correct for a for loop?

So first the condition is checked, then the loop body is executed, then the increment.

Is while loop faster than for loop in R?

for loops are fast. What you do inside the loop is slow (in comparison to vectorized operations). I would expect a while loop to be slower than a for loop since it needs to test a condition before each iteration. Keep in mind that R is an interpreted language, i.e., there are no compiler optimizations.


2 Answers

Yes, I believe that in theory it should make a difference (see example below).

R uses column-major ordering like Fortran (and unlike C), so to minimize cache misses you'd want to be traversing down columns. So for filling a matrix, the optimal approach is the one where the outer loop has our column index.

And for n-dimensional arrays, you'd want to bear this in mind, too. In the case that n = 3, I guess this would mean having the layer be the outermost loop, then the column, then the row. I could be mistaken here, though.

I ran this quick example with 5000 by 5000 matrices. And we are seeing a difference of about 50 seconds, with fill_matrix2() being faster.

    n <- 5000
    A <- matrix(NA, n, n)
    B <- matrix(NA, n, n)

    fill_matrix1 <- function(X, val) {
        for (i in 1:nrow(X)) {
            for (j in 1:ncol(X)) {
                X[i, j] <- val
            }
        }
        return(X)
    }

    fill_matrix2 <- function(X, val) {
        for (j in 1:ncol(X)) {
            for (i in 1:nrow(X)) {
                X[i, j] <- val
            }
        }
        return(X)
    }

    system.time(fill_matrix1(A, 0))
    system.time(fill_matrix2(B, 0))
like image 67
paulstey Avatar answered Nov 13 '22 09:11

paulstey


The order of the loops is practically irrelevant here. If you profile your code (see help("Rprof")) you'll see that the CPU time is spent in functions like survfit and coxph. And of course in growing out, which you should avoid. Pre-allocate out to its final size and fill it instead of growing it.

like image 4
Roland Avatar answered Nov 13 '22 09:11

Roland