Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficiencies for nested for loop

Tags:

r

I've created the following code that nests a for loop inside of a for loop in R. It is a simulation to calculate Power. I've read that R isn't great for doing for loops but I was wondering if there are any efficiencies I could apply to make this run a bit faster. I'm fairly new to R as well as programming of any sort. Right now the run times I'm seeing are:

m=10 I get .17 sec

m=100 I get 3.95 sec

m=1000 I get 246.26 sec

m=2000 I get 1003.55 sec

I was hoping to set the number of times to sample, m, upwards of 100K but I'm afraid to even set this at 10K

Here is the code:

m = 1000                        # number of times we are going to  take samples
popmean=120                     # set population mean at 120
popvar=225                      # set known/established population 
variance at 225
newvar=144                      # variance of new methodology 
alpha=.01                       # set alpha
teststatvect = matrix(nrow=m,ncol=1)    # empty vector to populate with test statistics
power = matrix(nrow=200,ncol=1)     # empty vector to populate with power

system.time(                    # not needed - using to gauge how long this takes
    for (n in 1:length(power))          # begin for loop for different sample sizes
      for(i in 1:m){                # begin for loop to take "m" samples
      y=rnorm(n,popmean,sqrt(newvar))   # sample of size n with mean 120 and var=144
      ts=sum((y-popmean)^2/popvar)      # calculate test statistic for each sample
      teststatvect[i]=ts            # loop and populate the vector to hold test statistics
      vecpvals=pchisq(teststatvect,n)   # calculate the pval of each statistic
      power[n]=length(which(vecpvals<=alpha))/length(vecpvals) # loop to populate      power vector. Power is the proportion lessthan ot equal to alpha
        }
   }
 )
like image 594
Will Phillips Avatar asked Apr 23 '26 13:04

Will Phillips


1 Answers

I reorganized your code a bit and got rid of the inner loop.

  • Sampling one long vector of random numbers (and then collapsing it into a matrix) is much faster than repeatedly sampling short vectors (replicate, as suggested in another answer, is nice for readability, but in this case you can do better by sampling random numbers in a block)
  • colSums is faster than summing inside a for loop or using apply.
  • it's just sugar (i.e. it isn't actually any more efficient), but you can use mean(pvals<=alpha) in place of sum(pvals<=alpha)/length(alpha)
  • I defined a function to return the power for a specified set of parameters (including sample size), then used sapply to range over the vector of sizes (not faster than a for loop, but cleaner and maybe easier to generalize).

Code:

powfun <- function(ssize=100,
                   m=1000,      ## samples per trial
                   popmean=120, ## pop mean
                   popvar=225,  ## known/established pop variance
                   newvar=144,  ## variance of new methodology
                   alpha=0.01,
                   sampchisq=FALSE)  ## sample directly from chi-squared distrib?
{
    if (!sampchisq) {
      ymat <- matrix(rnorm(ssize*m,popmean,sd=sqrt(newvar)),ncol=m)
      ts <- colSums((ymat-popmean)^2/popvar)          ## test statistic
    } else {
      ts <- rchisq(m,df=ssize)*newvar/popvar
    }
    pvals <- pchisq(ts,df=ssize)                    ## pval
    mean(pvals<=alpha)                              ## power
}

Do you really need the power for every integer value of sample size, or would a more widely spaced sample be OK (if you need exact values, interpolation would probably be pretty accurate)

ssizevec <- seq(10,250,by=5)
set.seed(101)
system.time(powvec <- sapply(ssizevec,powfun,m=5000))  ## 13 secs elapsed

This is reasonably fast and might get you up to m=1e5 if you needed, but I'm not quite sure why you need results that are that precise -- the power curve is reasonably smooth with m=5000 ...

If you're impatiently waiting for long simulations, you can also get a progress bar to print by replacing sapply(ssizevec,powfun,m=5000) with library(plyr); aaply(ssizevec,.margins=1,powfun,.progress="text",m=5000)

Finally, I think you can speed the whole up a lot by sampling chi-squared values directly, or by doing an analytical power calculation (!). I think that rchisq(m,df=ssize)*newvar/popvar is equivalent to the first two lines of the loop, and you might even be able to do a numerical computation on the chi-squared densities directly ...

system.time(powvec2 <- sapply(ssizevec,powfun,m=5000,sampchisq=TRUE))
## 0.24 seconds elapsed

(I just tried this out, sampling m=1e5 at every value of sample size from 1 to 200 ... it takes 24 seconds ... but I still think it might be unnecessary.)

A picture:

par(bty="l",las=1)
plot(ssizevec,powvec,type="l",xlab="sample size",ylab="power",
     xlim=c(0,250),ylim=c(0,1))
lines(ssizevec,powvec2,col="red")

enter image description here

like image 184
Ben Bolker Avatar answered Apr 25 '26 02:04

Ben Bolker