I have to speed up my script. I have some cycles like:
DT <- data.frame(Index=1:20, A=c(10:29))
cost1 <- 3
cost2 <- 0.05
cost3 <- 50
DT$S[1] <- cost1
for (j in 2:(20)) {
DT$S[j] <- DT$S[j-1]-cost3+DT$S[j-1]*cost2/12
}
Where cost1 and cost2 are constants. Is it possible to avoid writing a cycle?
Loops are slower in R than in C++ because R is an interpreted language (not compiled), even if now there is just-in-time (JIT) compilation in R (>= 3.4) that makes R loops faster (yet, still not as fast). Then, R loops are not that bad if you don't use too many iterations (let's say not more than 100,000 iterations).
Indeed, R for loops are inefficient, especially if you use them wrong. Searching for why R loops are slow discovers that many users are wondering about this question. Below, I summarize my experience and online discussions regarding this issue by providing some trivial code examples.
The main problem with your approach is that you are repeatedly calling elements of data.frame (DT$S
), but that is not needed in this calculations. If we replace that with vector and add the results to data.frame at the end, it is much faster. Also we can simplify the formula.
n <- 1e4
DT <- data.frame(Index = 1:n, A = seq(10, by = 1, length.out = n))
cost1 <- 3
cost2 <- 0.05
cost3 <- 50
your <- function() {
DT$S[1] <- cost1
for (j in 2:(n)) {
DT$S[j] <- DT$S[j - 1] - cost3 + DT$S[j - 1]*cost2/12
}
}
your()
My function:
my <- function() {
cc <- (1 + cost2/12)
r <- vector('numeric', length = n)
r[1] <- cost1
for (j in 2:(n)) {
# r[j] <- r[j - 1] - cost3 + r[j - 1] * cost2/12
r[j] <- r[j - 1] * cc - cost3
}
r
}
DT$S2 <- my()
all.equal(DT$S, DT$S2)
# [1] TRUE
microbenchmark::microbenchmark(your(), my(), times = 2)
# Unit: milliseconds
# expr min lq mean median uq max neval cld
# your() 487.229621 487.229621 490.86917 490.86917 494.508715 494.508715 2 b
# my() 1.515178 1.515178 1.59408 1.59408 1.672982 1.672982 2 a
Your column S
is defined by a first-order linear recurrence. The i-th term can be expressed in function of i
, see e.g. these slides.
> DT <- data.frame(Index=1:20)
> cost1 <- 3; cost2 <- 0.05; cost3 <- 50
> DT$S[1] <- cost1
> for (j in 2:(20)) {
+ DT$S[j] <- DT$S[j-1]-cost3+DT$S[j-1]*cost2/12
+ }
> DT$S
[1] 3.00000 -46.98750 -97.18328 -147.58821 -198.20316 -249.02901 -300.06663 -351.31691
[9] -402.78073 -454.45898 -506.35256 -558.46236 -610.78929 -663.33424 -716.09814 -769.08188
[17] -822.28639 -875.71258 -929.36138 -983.23372
> s <- 1+cost2/12
> s_powers <- s^(1:(N-1))
> cost1*s_powers - cost3*(1-s_powers)/(1-s)
[1] -46.98750 -97.18328 -147.58821 -198.20316 -249.02901 -300.06663 -351.31691 -402.78073
[9] -454.45898 -506.35256 -558.46236 -610.78929 -663.33424 -716.09814 -769.08188 -822.28639
[17] -875.71258 -929.36138 -983.23372
Let's compare four ways.
f1 <- function(){ # your way
DT$S[1] <- cost1
for (j in 2:N) {
DT$S[j] <- DT$S[j-1]-cost3+DT$S[j-1]*cost2/12
}
}
f2 <- function(){ # group the two DT$S[j-1] (cause DT$S[j-1] is slow)
DT$S[1] <- cost1
for (j in 2:N) {
DT$S[j] <- (1+cost2/12)*DT$S[j-1]-cost3
}
}
f3 <- function(){ # avoid DT$S[j-1] (@minem's answer)
u <- numeric(N)
u[1] <- cost1
for (j in 2:N) {
u[j] <- (1+cost2/12)*u[j-1]-cost3
}
DT$S <- u
}
f4 <- function(){ # express DT$S[j] in function of j
s <- 1+cost2/12
s_powers <- s^(1:(N-1))
u2N <- cost1*s_powers - cost3*(1-s_powers)/(1-s)
DT$S <- c(cost1, u2N)
}
Let's compare:
> library(microbenchmark)
> N <- 2000
> DT <- data.frame(Index=1:N)
> microbenchmark(
+ f1 = f1(),
+ f2 = f2(),
+ f3 = f3(),
+ f4 = f4()
+ )
Unit: microseconds
expr min lq mean median uq max neval cld
f1 65802.386 67920.918 73168.4472 69025.145 70347.8050 180938.153 100 c
f2 52641.373 54790.698 58553.8418 55916.565 57021.0145 163660.112 100 b
f3 375.736 396.932 458.5317 418.798 459.6295 974.593 100 a
f4 220.890 235.170 266.3977 240.971 259.9360 1318.199 100 a
The winner is f4
, the one which does not use recurrence.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With