Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

R Programming: Using previously calculated row to update each row

I have a very large time series and I need to create a different time series based on some arbitrary value in the beginning and change in the current time period. In the real dataset this change depends on other variables of the data frame, but for a purpose of MWE, I recreate it as follows:

initial_value <- 100
set.seed(123)
library(data.table)
df <- as.data.table(data.frame(num = c(1:10),change = rnorm(10)))

The new variable value is defined as its own value in the previous period plus the change in the current period. The value in the first observation is determined by an arbitrarily chosen initial_value. If there were no restrictions on value, it could be created simply as

df <- df[, value0 := initial_value + cumsum(change)]

This is very fast using data.table. However, unfortunately, change may also depend on the actual value in the previous period. Specifically, let's assume that whenever it reaches 102, the series needs to get to the initial_value in the next period and stay there for 3 periods. Thus, in the following data frame, I need to create the variable value while the code above produced value0:

    num      change    value0     value
 1:   1 -0.56047565  99.43952  99.43952
 2:   2 -0.23017749  99.20935  99.20935
 3:   3  1.55870831 100.76806 100.76806
 4:   4  0.07050839 100.83856 100.83856
 5:   5  0.12928774 100.96785 100.96785
 6:   6  1.71506499 102.68292 102.68292
 7:   7  0.46091621 103.14383 100.00000
 8:   8 -1.26506123 101.87877 100.00000
 9:   9 -0.68685285 101.19192 100.00000
10:  10 -0.44566197 100.74626  99.55434

So far the only way I managed to produce this result is using a loop:

df$value <- NA 
df$value[1] <- initial_value + df$change[1]
for (i in 2:nrow(df)) {
  if (is.na(df$value[i])) {
    if (df$value[i-1] < 102) {
      df$value[i] <- df$value[i-1] + df$change[i]
    } else {
      df$value[i:(i+2)] <- initial_value
    } 
  }
}

However, looping over (dozens of) millions of observations is extremely slow. Is there a way to possibly vectorize it or simply run the process more efficiently?

like image 847
Radek Janhuba Avatar asked Sep 27 '17 07:09

Radek Janhuba


1 Answers

I suggest you using Rcpp for simple loops. It's easy to replicate requested logic.
Your function:

fun_r <- function(){
  df$value <- NA 
  df$value[1] <- initial_value + df$change[1]
  for (i in 2:nrow(df)) {
    if (is.na(df$value[i])) {
      if (df$value[i-1] < 102) {
        df$value[i] <- df$value[i-1] + df$change[i]
      } else {
        df$value[i:(i+2)] <- initial_value
      } 
    }
  }
  df
}

Same function in c++

library(Rcpp)
cppFunction({'
  NumericVector fun_c(NumericVector change, double init, double thr){
  int n = change.size();
  int end;
  NumericVector out(n);
  out[ 0 ] = init + change[ 0 ];

  for(int i = 1; i < n; i++){

    if( out[ i - 1 ] < thr ){

      out[i] = out[ i - 1 ] + change[ i ];

    } else {

      end = std::min( i + 2 , n - 1);
      for(int j = i; j <= end; j++) {
        out[ j ] = init;
        i = j;
      }
    }

  }
  return out;
}
'})

UPDATE: R function written for the first time (above) is based on data.frame subsetting, which is highly ineffective way to deal with data in R. Function is simply an underdog expected to lose in all benchmarks. While looping, one should always vectorize (vectors and matrix) computations. Below function which are more competetive with Rcpp example:

fun_r2 <- function(change, initial_value, thr ){
  n <- length(change)
  value <- numeric(n) 
  value[1] <- initial_value + change[1]

  for (i in 2:n) {
    if ( value[i]==0 ) {
      if (value[i-1] < thr) {
        value[i] <- value[i-1] + change[i]
      } else {
        value[i:(i+2)] <- initial_value
      } 
    }
  }
  value
}

Three functions produces the same results, and fun_c is the fastest, but vectorized fun_r2 function can be considered as acceptable.

df$value <- fun_r()
df$value_r2 <- fun_r2(as.vector(df$change), init=100, thr=102)
df$value_rcpp <- fun_c(df$change, init=100, thr=102)

all.equal(df$value, df$value_rcpp)
all.equal(df$value, df$value_r2)
# TRUE

mb <- microbenchmark::microbenchmark(
  fun_r(),
  fun_r2(as.vector(df$change), init=100, thr=102),
  fun_c(df$change, init=100, thr=102),
  times=100L
)

#    expr       mean
# 1 fun_r()   6650.72481
# 2 fun_r2()  42.28442
# 3 fun_c()   18.24121

Enjoy!

like image 82
GoGonzo Avatar answered Nov 10 '22 11:11

GoGonzo