Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Loops inefficiency in R

Good morning,

I have been developing for a few months in R and I have to make sure that the execution time of my code is not too long because I analyze big datasets.

Hence, I have been trying to use as much vectorized functions as possible.

However, I am still wondering something.

What is costly in R is not the loop itself right? I mean, the problem arises when you start modifying variables within the loop for example is that correct?

Hence I was thinking, what if you simply have to run a function on each element (you actually do not care about the result). For example to write data in a database. What should you do?

1) use mapply without storing the result anywhere?

2) do a loop over the vector and only apply f(i) to each element?

3) is there a better function I might have missed?

(that's of course assuming your function is not optimally vectorized).

What about the foreach package? Have you experienced any performance improvement by using it?

like image 584
SRKX Avatar asked Jun 28 '10 02:06

SRKX


2 Answers

Just a couple of comments. A for loop is roughly as fast as apply and its variants, and the real speed-ups come when you vectorise your function as much as possible (that is, using low-level loops, rather than apply, which just hides the for loop). I'm not sure if this is the best example, but consider the following:

> n <- 1e06
> sinI <- rep(NA,n)
> system.time(for(i in 1:n) sinI[i] <- sin(i))
   user  system elapsed 
  3.316   0.000   3.358 
> system.time(sinI <- sapply(1:n,sin))
   user  system elapsed 
  5.217   0.016   5.311 
> system.time(sinI <- unlist(lapply(1:n,sin),
+       recursive = FALSE, use.names = FALSE))
   user  system elapsed 
  1.284   0.012   1.303 
> system.time(sinI <- sin(1:n))
   user  system elapsed 
  0.056   0.000   0.057 

In one of the comments below, Marek points out that the time consuming part of the for loop above is actually the ]<- part:

> system.time(sinI <- unlist(lapply(1:n,sin),
+       recursive = FALSE, use.names = FALSE))
   user  system elapsed 
  1.284   0.012   1.303 

The bottlenecks which can't immediately be vectorised can be rewritten in C or Fortran, compiled with R CMD SHLIB, and then plugged in with .Call, .C or .Fortran.

Also, see these links for more info about loop optimisation in R. Also check out the article "How Can I Avoid This Loop or Make It Faster?" in R News.

like image 175
nullglob Avatar answered Oct 06 '22 01:10

nullglob


vapply avoids the post-processing by requiring that you specify what the return value is. It turns out to be 3.4 times faster than the for-loop:

> system.time(for(i in 1:n) sinI[i] <- sin(i))
   user  system elapsed 
   2.41    0.00    2.39 

> system.time(sinI <- unlist(lapply(1:n,sin), recursive = FALSE, use.names = FALSE))
   user  system elapsed 
   1.46    0.00    1.45 

> system.time(sinI <- vapply(1:n,sin, numeric(1)))
   user  system elapsed 
   0.71    0.00    0.69 
like image 31
Tommy Avatar answered Oct 06 '22 01:10

Tommy