Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why not use a for loop?

Tags:

for-loop

r

I've been seeing a lot of comments among data scientists online about how for loops are not advisable. However, I recently found myself in a situation where using one was helpful. I would like to know if there is a better alternative for the following process (and why the alternative would be better):

I needed to run a series of repeated-measures ANOVA and approached the problem similarly to the reproducible example you see below.

[I am aware that there are other issues regarding running multiple ANOVA models and that there are other options for these sorts of analyses, but for now I'd simply like to hear about the use of for loop]

As an example, four repeated-measures ANOVA models - four dependent variables that were each measured at three occasions:

set.seed(1976)
code <- seq(1:60)
time <- rep(c(0,1,2), each = 20)
DV1 <- c(rnorm(20, 10, 2), rnorm(20, 10, 2), rnorm(20, 14, 2))
DV2 <- c(rnorm(20, 10, 2), rnorm(20, 10, 2), rnorm(20, 10, 2))
DV3 <- c(rnorm(20, 10, 2), rnorm(20, 10, 2), rnorm(20, 8, 2))
DV4 <- c(rnorm(20, 10, 2), rnorm(20, 10, 2), rnorm(20, 10, 2))
dat <- data.frame(code, time, DV1, DV2, DV3, DV4)

outANOVA <- list()

for (i in names(dat)) {
  y <- dat[[i]]
  outANOVA[i] <- summary(aov(y ~ factor(time) + Error(factor(code)), 
                                  data = dat))
}

outANOVA
like image 796
Peter Miksza Avatar asked Feb 14 '18 17:02

Peter Miksza


2 Answers

You could write it this way, it's more compact:

outANOVA <-
  lapply(dat,function(y)
    summary(aov(y ~ factor(time) + Error(factor(code)),data = dat)))

for loops are not necessarily slower than apply functions but they're less easy to read for many people. It is to some extent a matter of taste.

The real crime is to use a for loop when a vectorized function is available. These vectorized functions usually contain for loops written in C that are much faster (or call functions that do).

Notice that in this case we also could avoid to create a global variable y and that we didn't have to initialize the list outANOVA.

Another point, directly from this relevant post :For loops in R and computational speed (answer by Glen_b):

For loops in R are not always slower than other approaches, like apply - but there's one huge bugbear - •never grow an array inside a loop

Instead, make your arrays full-size before you loop and then fill them up.

In your case you're growing outANOVA, for big loops it could become problematic.

Here is some microbenchmark of different methods on a simple example:

n <- 100000
microbenchmark::microbenchmark(
preallocated_vec  = {x <- vector(length=n); for(i in 1:n) {x[i] <- i^2}},
preallocated_vec2 = {x <- numeric(n); for(i in 1:n) {x[i] <- i^2}},
incremented_vec   = {x <- vector(); for(i in 1:n) {x[i] <- i^2}},
preallocated_list = {x <- vector(mode = "list", length = n); for(i in 1:n) {x[i] <- i^2}},
incremented_list  = {x <- list(); for(i in 1:n) {x[i] <- i^2}},
sapply            = sapply(1:n, function(i) i^2),
lapply            = lapply(1:n, function(i) i^2),
times=20)

# Unit: milliseconds
# expr                     min         lq       mean     median         uq        max neval
# preallocated_vec    9.784237  10.100880  10.686141  10.367717  10.755598  12.839584    20
# preallocated_vec2   9.953877  10.315044  10.979043  10.514266  11.792158  12.789175    20
# incremented_vec    74.511906  79.318298  81.277439  81.640597  83.344403  85.982590    20
# preallocated_list  10.680134  11.197962  12.382082  11.416352  13.528562  18.620355    20
# incremented_list  196.759920 201.418857 212.716685 203.485940 205.441188 393.522857    20
# sapply              6.557739   6.729191   7.244242   7.063643   7.186044   9.098730    20
# lapply              6.019838   6.298750   6.835941   6.571775   6.844650   8.812273    20
like image 101
Moody_Mudskipper Avatar answered Oct 10 '22 22:10

Moody_Mudskipper


For your use case, I would say the point is moot. Applying vectorization (and, in the process, obfuscating the code) has no benefits here.

Here's an example below, where I did a microbenchmark::microbenchmark of your solution as presented in OP, Moody's solution as in his post, and a third solution of mine, with even more vectorization (triple nested lapply).

Microbenchmark

set.seed(1976); code = seq(1:60); time = rep(c(0,1,2), each = 20);
DV1 = c(rnorm(20, 10, 2), rnorm(20, 10, 2), rnorm(20, 14, 2)); DV2 = c(rnorm(20, 10, 2), rnorm(20, 10, 2), rnorm(20, 10, 2)); DV3 = c(rnorm(20, 10, 2), rnorm(20, 10, 2), rnorm(20, 8, 2)); DV4 = c(rnorm(20, 10, 2), rnorm(20, 10, 2), rnorm(20, 10, 2))
dat = data.frame(code, time, DV1, DV2, DV3, DV4)

library(microbenchmark)

microbenchmark(
    `Peter Miksza` = {
        outANOVA1 = list()
        for (i in names(dat)) {
            y = dat[[i]]
            outANOVA1[i] = summary(aov(y ~ factor(time) + Error(factor(code)), 
                data = dat))
    }},
    Moody_Mudskipper = {
        outANOVA2 =
            lapply(dat,function(y)
                summary(aov(y ~ factor(time) + Error(factor(code)),data = dat)))
    },
    `catastrophic_failure` = {
        outANOVA3 = 
            lapply(lapply(lapply(dat, function(y) y ~ factor(time) + Error(factor(code))), aov, data = dat), summary)
    },
    times = 1000L)

Results

#Unit: milliseconds
#                 expr      min       lq     mean   median       uq       max neval cld
#         Peter Miksza 26.25641 27.63011 31.58110 29.60774 32.81374 136.84448  1000   b
#     Moody_Mudskipper 22.93190 23.86683 27.20893 25.61352 28.61729 135.58811  1000  a 
# catastrophic_failure 22.56987 23.57035 26.59955 25.15516 28.25666  68.87781  1000  a 

fiddling with JIT compilation, running compiler::setCompilerOptions(optimize = 0) and compiler::enableJIT(0) the following result ensues as well

#Unit: milliseconds
#                 expr      min       lq     mean   median       uq      max neval cld
#         Peter Miksza 23.10125 24.27295 28.46968 26.52559 30.45729 143.0731  1000   a
#     Moody_Mudskipper 22.82366 24.35622 28.33038 26.72574 30.27768 146.4284  1000   a
# catastrophic_failure 22.59413 24.04295 27.99147 26.23098 29.88066 120.6036  1000   a

Conclusion

As alluded by Dirk's comment, there isn't a difference in performance, but readability is greatly impaired using vectorization.

On growing lists

Experimenting with Moody's solutions, it seems growing lists can be a bad idea if the resulting list is moderately long. Also, using byte-compiled functions directly can provide a small improvement in performance. Both are expected behaviors. Pre-allocation might prove sufficient for your application though.

like image 32
catastrophic-failure Avatar answered Oct 10 '22 21:10

catastrophic-failure