Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Speed up iterative loop calculation with R

I have to speed up my script. I have some cycles like:

DT <- data.frame(Index=1:20, A=c(10:29))

cost1 <- 3
cost2 <- 0.05
cost3 <- 50

DT$S[1] <- cost1
for (j in 2:(20)) {
  DT$S[j] <- DT$S[j-1]-cost3+DT$S[j-1]*cost2/12
}

Where cost1 and cost2 are constants. Is it possible to avoid writing a cycle?

like image 841
stefanodv Avatar asked Jun 27 '18 07:06

stefanodv


People also ask

WHY DO FOR loops take so long in R?

Loops are slower in R than in C++ because R is an interpreted language (not compiled), even if now there is just-in-time (JIT) compilation in R (>= 3.4) that makes R loops faster (yet, still not as fast). Then, R loops are not that bad if you don't use too many iterations (let's say not more than 100,000 iterations).

Are for loops slow in R?

Indeed, R for loops are inefficient, especially if you use them wrong. Searching for why R loops are slow discovers that many users are wondering about this question. Below, I summarize my experience and online discussions regarding this issue by providing some trivial code examples.


2 Answers

The main problem with your approach is that you are repeatedly calling elements of data.frame (DT$S), but that is not needed in this calculations. If we replace that with vector and add the results to data.frame at the end, it is much faster. Also we can simplify the formula.

n <- 1e4
DT <- data.frame(Index = 1:n, A = seq(10, by = 1, length.out = n))

cost1 <- 3
cost2 <- 0.05
cost3 <- 50

your <- function() {
  DT$S[1] <- cost1
  for (j in 2:(n)) {
    DT$S[j] <- DT$S[j - 1] - cost3 + DT$S[j - 1]*cost2/12
  }
}
your()

My function:

my <- function() {    
  cc <- (1 + cost2/12)      
  r <- vector('numeric', length = n)
  r[1] <- cost1
  for (j in 2:(n)) {
    # r[j] <- r[j - 1] - cost3 + r[j - 1] * cost2/12
    r[j] <-  r[j - 1] * cc - cost3
  }
  r
}

DT$S2 <- my()
all.equal(DT$S, DT$S2)
# [1] TRUE

microbenchmark::microbenchmark(your(), my(), times = 2)
# Unit: milliseconds
#   expr        min         lq      mean    median         uq        max neval cld
# your() 487.229621 487.229621 490.86917 490.86917 494.508715 494.508715     2   b
#   my()   1.515178   1.515178   1.59408   1.59408   1.672982   1.672982     2  a 
like image 86
minem Avatar answered Sep 25 '22 05:09

minem


Your column S is defined by a first-order linear recurrence. The i-th term can be expressed in function of i, see e.g. these slides.

> DT <- data.frame(Index=1:20)
> cost1 <- 3; cost2 <- 0.05; cost3 <- 50
> DT$S[1] <- cost1
> for (j in 2:(20)) {
+   DT$S[j] <- DT$S[j-1]-cost3+DT$S[j-1]*cost2/12
+ }
> DT$S
 [1]    3.00000  -46.98750  -97.18328 -147.58821 -198.20316 -249.02901 -300.06663 -351.31691
 [9] -402.78073 -454.45898 -506.35256 -558.46236 -610.78929 -663.33424 -716.09814 -769.08188
[17] -822.28639 -875.71258 -929.36138 -983.23372
> s <- 1+cost2/12
> s_powers <- s^(1:(N-1))
> cost1*s_powers - cost3*(1-s_powers)/(1-s)
 [1]  -46.98750  -97.18328 -147.58821 -198.20316 -249.02901 -300.06663 -351.31691 -402.78073
 [9] -454.45898 -506.35256 -558.46236 -610.78929 -663.33424 -716.09814 -769.08188 -822.28639
[17] -875.71258 -929.36138 -983.23372

Let's compare four ways.

f1 <- function(){ # your way
  DT$S[1] <- cost1
  for (j in 2:N) {
    DT$S[j] <- DT$S[j-1]-cost3+DT$S[j-1]*cost2/12
  }
}
f2 <- function(){ # group the two DT$S[j-1] (cause DT$S[j-1] is slow)
  DT$S[1] <- cost1
  for (j in 2:N) {
    DT$S[j] <- (1+cost2/12)*DT$S[j-1]-cost3
  }
}
f3 <- function(){ # avoid DT$S[j-1] (@minem's answer)
  u <- numeric(N)
  u[1] <- cost1
  for (j in 2:N) {
    u[j] <- (1+cost2/12)*u[j-1]-cost3
  }
  DT$S <- u
}
f4 <- function(){ # express DT$S[j] in function of j
  s <- 1+cost2/12
  s_powers <- s^(1:(N-1))
  u2N <- cost1*s_powers - cost3*(1-s_powers)/(1-s)
  DT$S <- c(cost1, u2N)
}

Let's compare:

> library(microbenchmark)
> N <- 2000
> DT <- data.frame(Index=1:N)
> microbenchmark(
+   f1 = f1(),
+   f2 = f2(),
+   f3 = f3(),
+   f4 = f4()
+ )
Unit: microseconds
 expr       min        lq       mean    median         uq        max neval cld
   f1 65802.386 67920.918 73168.4472 69025.145 70347.8050 180938.153   100   c
   f2 52641.373 54790.698 58553.8418 55916.565 57021.0145 163660.112   100  b 
   f3   375.736   396.932   458.5317   418.798   459.6295    974.593   100 a  
   f4   220.890   235.170   266.3977   240.971   259.9360   1318.199   100 a  

The winner is f4, the one which does not use recurrence.

like image 35
Stéphane Laurent Avatar answered Sep 22 '22 05:09

Stéphane Laurent