Good morning,
I have been developing for a few months in R and I have to make sure that the execution time of my code is not too long because I analyze big datasets.
Hence, I have been trying to use as much vectorized functions as possible.
However, I am still wondering something.
What is costly in R is not the loop itself right? I mean, the problem arises when you start modifying variables within the loop for example is that correct?
Hence I was thinking, what if you simply have to run a function on each element (you actually do not care about the result). For example to write data in a database. What should you do?
1) use mapply without storing the result anywhere?
2) do a loop over the vector and only apply f(i) to each element?
3) is there a better function I might have missed?
(that's of course assuming your function is not optimally vectorized).
What about the foreach
package? Have you experienced any performance improvement by using it?
Just a couple of comments. A for
loop is roughly as fast as apply
and its variants, and the real speed-ups come when you vectorise your function as much as possible (that is, using low-level loops, rather than apply
, which just hides the for
loop). I'm not sure if this is the best example, but consider the following:
> n <- 1e06
> sinI <- rep(NA,n)
> system.time(for(i in 1:n) sinI[i] <- sin(i))
user system elapsed
3.316 0.000 3.358
> system.time(sinI <- sapply(1:n,sin))
user system elapsed
5.217 0.016 5.311
> system.time(sinI <- unlist(lapply(1:n,sin),
+ recursive = FALSE, use.names = FALSE))
user system elapsed
1.284 0.012 1.303
> system.time(sinI <- sin(1:n))
user system elapsed
0.056 0.000 0.057
In one of the comments below, Marek points out that the time consuming part of the for
loop above is actually the ]<-
part:
> system.time(sinI <- unlist(lapply(1:n,sin),
+ recursive = FALSE, use.names = FALSE))
user system elapsed
1.284 0.012 1.303
The bottlenecks which can't immediately be vectorised can be rewritten in C or Fortran, compiled with R CMD SHLIB
, and then plugged in with .Call
, .C
or .Fortran
.
Also, see these links for more info about loop optimisation in R. Also check out the article "How Can I Avoid This Loop or Make It Faster?" in R News.
vapply avoids the post-processing by requiring that you specify what the return value is. It turns out to be 3.4 times faster than the for-loop:
> system.time(for(i in 1:n) sinI[i] <- sin(i))
user system elapsed
2.41 0.00 2.39
> system.time(sinI <- unlist(lapply(1:n,sin), recursive = FALSE, use.names = FALSE))
user system elapsed
1.46 0.00 1.45
> system.time(sinI <- vapply(1:n,sin, numeric(1)))
user system elapsed
0.71 0.00 0.69
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With