As a matter of best practices, I'm trying to determine if it's better to create a function and apply()
it across a matrix, or if it's better to simply loop a matrix through the function. I tried it both ways and was surprised to find apply()
is slower. The task is to take a vector and evaluate it as either being positive or negative and then return a vector with 1 if it's positive and -1 if it's negative. The mash()
function loops and the squish()
function is passed to the apply()
function.
million <- as.matrix(rnorm(100000)) mash <- function(x){ for(i in 1:NROW(x)) if(x[i] > 0) { x[i] <- 1 } else { x[i] <- -1 } return(x) } squish <- function(x){ if(x >0) { return(1) } else { return(-1) } } ptm <- proc.time() loop_million <- mash(million) proc.time() - ptm ptm <- proc.time() apply_million <- apply(million,1, squish) proc.time() - ptm
loop_million
results:
user system elapsed 0.468 0.008 0.483
apply_million
results:
user system elapsed 1.401 0.021 1.423
What is the advantage to using apply()
over a for
loop if performance is degraded? Is there a flaw in my test? I compared the two resulting objects for a clue and found:
> class(apply_million) [1] "numeric" > class(loop_million) [1] "matrix"
Which only deepens the mystery. The apply()
function cannot accept a simple numeric vector and that's why I cast it with as.matrix()
in the beginning. But then it returns a numeric. The for
loop is fine with a simple numeric vector. And it returns an object of same class as that one passed to it.
The apply functions do run a for loop in the background. However they often do it in the C programming language (which is used to build R). This does make the apply functions a few milliseconds faster than regular for loops.
The apply() function loops over the DataFrame in a specific axis, i.e., it can either loop over columns(axis=1) or loop over rows(axis=0). apply() is better than iterrows() since it uses C extensions for Python in Cython. We are now in microseconds, making out loop faster by ~1900 times the naive loop in time.
Indeed, R for loops are inefficient, especially if you use them wrong. Searching for why R loops are slow discovers that many users are wondering about this question. Below, I summarize my experience and online discussions regarding this issue by providing some trivial code examples.
The point of the apply (and plyr) family of functions is not speed, but expressiveness. They also tend to prevent bugs because they eliminate the book keeping code needed with loops.
Lately, answers on stackoverflow have over-emphasised speed. Your code will get faster on its own as computers get faster and R-core optimises the internals of R. Your code will never get more elegant or easier to understand on its own.
In this case you can have the best of both worlds: an elegant answer using vectorisation that is also very fast, (million > 0) * 2 - 1
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With