I've been seeing a lot of comments among data scientists online about how for loops are not advisable. However, I recently found myself in a situation where using one was helpful. I would like to know if there is a better alternative for the following process (and why the alternative would be better):
I needed to run a series of repeated-measures ANOVA and approached the problem similarly to the reproducible example you see below.
[I am aware that there are other issues regarding running multiple ANOVA models and that there are other options for these sorts of analyses, but for now I'd simply like to hear about the use of for loop]
As an example, four repeated-measures ANOVA models - four dependent variables that were each measured at three occasions:
set.seed(1976)
code <- seq(1:60)
time <- rep(c(0,1,2), each = 20)
DV1 <- c(rnorm(20, 10, 2), rnorm(20, 10, 2), rnorm(20, 14, 2))
DV2 <- c(rnorm(20, 10, 2), rnorm(20, 10, 2), rnorm(20, 10, 2))
DV3 <- c(rnorm(20, 10, 2), rnorm(20, 10, 2), rnorm(20, 8, 2))
DV4 <- c(rnorm(20, 10, 2), rnorm(20, 10, 2), rnorm(20, 10, 2))
dat <- data.frame(code, time, DV1, DV2, DV3, DV4)
outANOVA <- list()
for (i in names(dat)) {
y <- dat[[i]]
outANOVA[i] <- summary(aov(y ~ factor(time) + Error(factor(code)),
data = dat))
}
outANOVA
You could write it this way, it's more compact:
outANOVA <-
lapply(dat,function(y)
summary(aov(y ~ factor(time) + Error(factor(code)),data = dat)))
for
loops are not necessarily slower than apply functions but they're less easy to read for many people. It is to some extent a matter of taste.
The real crime is to use a for
loop when a vectorized function is available. These vectorized functions usually contain for loops written in C that are much faster (or call functions that do).
Notice that in this case we also could avoid to create a global variable y
and that we didn't have to initialize the list outANOVA
.
Another point, directly from this relevant post :For loops in R and computational speed (answer by Glen_b):
For loops in R are not always slower than other approaches, like apply - but there's one huge bugbear - •never grow an array inside a loop
Instead, make your arrays full-size before you loop and then fill them up.
In your case you're growing outANOVA
, for big loops it could become problematic.
Here is some microbenchmark
of different methods on a simple example:
n <- 100000
microbenchmark::microbenchmark(
preallocated_vec = {x <- vector(length=n); for(i in 1:n) {x[i] <- i^2}},
preallocated_vec2 = {x <- numeric(n); for(i in 1:n) {x[i] <- i^2}},
incremented_vec = {x <- vector(); for(i in 1:n) {x[i] <- i^2}},
preallocated_list = {x <- vector(mode = "list", length = n); for(i in 1:n) {x[i] <- i^2}},
incremented_list = {x <- list(); for(i in 1:n) {x[i] <- i^2}},
sapply = sapply(1:n, function(i) i^2),
lapply = lapply(1:n, function(i) i^2),
times=20)
# Unit: milliseconds
# expr min lq mean median uq max neval
# preallocated_vec 9.784237 10.100880 10.686141 10.367717 10.755598 12.839584 20
# preallocated_vec2 9.953877 10.315044 10.979043 10.514266 11.792158 12.789175 20
# incremented_vec 74.511906 79.318298 81.277439 81.640597 83.344403 85.982590 20
# preallocated_list 10.680134 11.197962 12.382082 11.416352 13.528562 18.620355 20
# incremented_list 196.759920 201.418857 212.716685 203.485940 205.441188 393.522857 20
# sapply 6.557739 6.729191 7.244242 7.063643 7.186044 9.098730 20
# lapply 6.019838 6.298750 6.835941 6.571775 6.844650 8.812273 20
For your use case, I would say the point is moot. Applying vectorization (and, in the process, obfuscating the code) has no benefits here.
Here's an example below, where I did a microbenchmark::microbenchmark
of your solution as presented in OP, Moody's solution as in his post, and a third solution of mine, with even more vectorization (triple nested lapply
).
set.seed(1976); code = seq(1:60); time = rep(c(0,1,2), each = 20);
DV1 = c(rnorm(20, 10, 2), rnorm(20, 10, 2), rnorm(20, 14, 2)); DV2 = c(rnorm(20, 10, 2), rnorm(20, 10, 2), rnorm(20, 10, 2)); DV3 = c(rnorm(20, 10, 2), rnorm(20, 10, 2), rnorm(20, 8, 2)); DV4 = c(rnorm(20, 10, 2), rnorm(20, 10, 2), rnorm(20, 10, 2))
dat = data.frame(code, time, DV1, DV2, DV3, DV4)
library(microbenchmark)
microbenchmark(
`Peter Miksza` = {
outANOVA1 = list()
for (i in names(dat)) {
y = dat[[i]]
outANOVA1[i] = summary(aov(y ~ factor(time) + Error(factor(code)),
data = dat))
}},
Moody_Mudskipper = {
outANOVA2 =
lapply(dat,function(y)
summary(aov(y ~ factor(time) + Error(factor(code)),data = dat)))
},
`catastrophic_failure` = {
outANOVA3 =
lapply(lapply(lapply(dat, function(y) y ~ factor(time) + Error(factor(code))), aov, data = dat), summary)
},
times = 1000L)
#Unit: milliseconds
# expr min lq mean median uq max neval cld
# Peter Miksza 26.25641 27.63011 31.58110 29.60774 32.81374 136.84448 1000 b
# Moody_Mudskipper 22.93190 23.86683 27.20893 25.61352 28.61729 135.58811 1000 a
# catastrophic_failure 22.56987 23.57035 26.59955 25.15516 28.25666 68.87781 1000 a
fiddling with JIT compilation, running compiler::setCompilerOptions(optimize = 0)
and compiler::enableJIT(0)
the following result ensues as well
#Unit: milliseconds
# expr min lq mean median uq max neval cld
# Peter Miksza 23.10125 24.27295 28.46968 26.52559 30.45729 143.0731 1000 a
# Moody_Mudskipper 22.82366 24.35622 28.33038 26.72574 30.27768 146.4284 1000 a
# catastrophic_failure 22.59413 24.04295 27.99147 26.23098 29.88066 120.6036 1000 a
As alluded by Dirk's comment, there isn't a difference in performance, but readability is greatly impaired using vectorization.
Experimenting with Moody's solutions, it seems growing lists can be a bad idea if the resulting list is moderately long. Also, using byte-compiled functions directly can provide a small improvement in performance. Both are expected behaviors. Pre-allocation might prove sufficient for your application though.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With