Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Loop calculation with previous value not using for in R

I'm a beginning R programmer. I have trouble in a loop calculation with a previous value like recursion. An example of my data:

 dt <- data.table(a = c(0:4), b = c( 0, 1, 2, 1, 3))

And calculated value 'c' is y[n] = (y[n-1] + b[n])*a[n]. Initial value of c is 0. (c[1] = 0)

I used the for loop and the code and result is as below.

dt$y <- 0
for (i in 2:nrow(dt)) {
  dt$y[i] <- (dt$y[i - 1] + dt$b[i]) * dt$a[i]
}

   a b  y
1: 0 0  0
2: 1 1  1
3: 2 2  6
4: 3 1 21
5: 4 3 96

This result is what I want. However, my data has over 1,000,000 rows and several columns, therefore I'm trying to find other ways without using a for loop. I tried to use "Reduce()", but it only works with a single vector (ex. y[n] = y_[n-1]+b[n]). As shown above, my function uses two vectors, a and b, so I can't find a solution.

Is there a more efficient way to be faster without using a for loop, such as using a recursive function or any good package functions?

like image 259
Flyest Avatar asked May 25 '26 19:05

Flyest


1 Answers

This kind of computation cannot make use of R's advantage of vectorization because of the iterative dependencies. But the slow-down appears to really be coming from indexing performance on a data.frame or data.table.

Interestingly, I was able to speed up the loop considerably by accessing a, b, and y directly as numeric vectors (1000+ fold advantage for 2*10^5 rows) or as matrix "columns" (100+ fold advantage for 2*10^5 rows) versus as columns in a data.table or data.frame.

This old discussion may still shed some light on this rather surprising result: https://stat.ethz.ch/pipermail/r-help/2011-July/282666.html

Please note that I also made a different toy data.frame, so I could test a larger example without returning Inf as y grew with i:

Option data.frame (numeric vectors embedded in a data.frame or data.table per your example):

vec_length <- 200000
dt <- data.frame(a=seq(from=0, to=1, length.out = vec_length), b=seq(from=0, to=-1, length.out = vec_length), y=0)
system.time(for (i in 2:nrow(dt)) {
  dt$y[i] <- (dt$y[i - 1] + dt$b[i]) * dt$a[i]
})
#user  system elapsed 
#79.39  146.30  225.78
#NOTE: Sorry, I didn't have the patience to let the data.table version finish for vec_length=2*10^5.  
tail(dt$y)
#[1] -554.1953 -555.1842 -556.1758 -557.1702 -558.1674 -559.1674

Option vector (numeric vectors extracted in advance of loop):

vec_length <- 200000
dt <- data.frame(a=seq(from=0, to=1, length.out = vec_length), b=seq(from=0, to=-1, length.out = vec_length), y=0)
y <- as.numeric(dt$y)
a <- as.numeric(dt$a)
b <- as.numeric(dt$b)
system.time(for (i in 2:length(y)) {
  y[i] <- (y[i - 1] + b[i]) * a[i]
})
#user  system elapsed 
#0.03    0.00    0.03 
tail(y)
#[1] -554.1953 -555.1842 -556.1758 -557.1702 -558.1674 -559.1674

Option matrix (data.frame converted to matrix before loop):

vec_length <- 200000
dt <- as.matrix(data.frame(a=seq(from=0, to=1, length.out = vec_length), b=seq(from=0, to=-1, length.out = vec_length), y=0))
system.time(for (i in 2:nrow(dt)) {
  dt[i, 1] <- (dt[i - 1, 3] + dt[i, 2]) * dt[i, 1]
})
#user  system elapsed 
#0.67    0.01    0.69
tail(dt[,3])
#[1] -554.1953 -555.1842 -556.1758 -557.1702 -558.1674 -559.1674
#NOTE: a matrix is actually a vector but with an additional attribute (it's "dim") that says how the "matrix" should be organized into rows and columns

Option data.frame with matrix style indexing:

vec_length <- 200000
dt <- data.frame(a=seq(from=0, to=1, length.out = vec_length), b=seq(from=0, to=-1, length.out = vec_length), y=0)
system.time(for (i in 2:nrow(dt)) {
    dt[i, 3] <- (dt[(i - 1), 3] + dt[i, 2]) * dt[i, 1]
})
#user  system elapsed 
#110.69    0.03  112.01 
tail(dt[,3])
#[1] -554.1953 -555.1842 -556.1758 -557.1702 -558.1674 -559.1674
like image 163
ThetaFC Avatar answered May 27 '26 09:05

ThetaFC



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!