Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Optimizing loop on large data file in R, perhaps using Rcpp

I have a loop in R that that's quite slow (but works). Currently, this calculation takes about ~3 minutes on my laptop, and I think it can be improved. Eventually, I'll loop through many data files running calculations based on the results of this code, and I'd like to make the current code faster if possible.

Basically, for each date, for 11 different values of X, the loop grabs the last X years' worth of rainfall values (Y), finds a linear inverse weighting (Z) so that the oldest rainfall values are weighted least, multiples the rain (Y) and weights (Z) to get a vector A, then takes the sum of A as the final result. This is done for thousands of dates.

However, I couldn't think of or find advice for any way to make this faster in R, so I attempted to rewrite it in Rcpp, in which I have limited knowledge of. My Rcpp code does not duplicate the R code exactly, as the resulting matrix is different (wrong) from what it should be (out1 vs out2; I know out1 is correct). It seems like the Rcpp code is faster, but I can only test it using a few columns because it begins crashing (fatal error in RStudio) if I attempt to run all 11 columns (i <= 10).

I'm looking for feedback on how I can improve the R code and/or correct the Rcpp code to provide the correct result and not crash in the process.

(Although the code I've posted below doesn't show it, the data is loaded into R the way it is [as a dataframe] for a few calculations done outside of the code shown. For the specific calculation shown here, only column 2 of the dataframe is used.)

The data file is here: https://drive.google.com/file/d/0Bw_Ca37oxVmJekFBR2t4eDdKeGM/view?usp=sharing

Attempt in R

library(readxl)

library(readxl)
library(Rcpp)
file = data.frame(read_excel("lake.xlsx", trim_ws=T)
col_types=c("date","numeric","numeric","date",rep("numeric",4),"text")))
file[,1] = as.Date(file[,1], "%Y/%m/%d", tz="UTC")
file[,4] = as.Date(file[,4], "%Y/%m/%d", tz="UTC")

rainSUM = function(df){
rainsum = data.frame("6m"=as.numeric(), "1yr"=as.numeric(), "2yr"=as.numeric(), "3yr"=as.numeric(), "4yr"=as.numeric(), "5yr"=as.numeric(), "6yr"=as.numeric(), "7yr"=as.numeric(), "8yr"=as.numeric(), "9yr"=as.numeric(), "10yr"=as.numeric()) # create dataframe for storing the sum of weighted last d values

  Tdays <- length(df[,1])

  for(i in 1:11) {           # loop through the lags
    if (i==1) {
      d <- 183               # 6 month lag only has 183 days,
    } else {
      d <- (i-1)*366         # the rest have 366 days times the number of years
    }
    w <- 0:(d-1)/d

    for(k in 1:Tdays) {      # loop through rows of rain dataframe (k = row)
      if(d>k){               # get number of rain values needed for the lag
        rainsum[k,i] <- sum(df[1:k,2] * w[(d-k+1):d])                                
      } else{
        rainsum[k,i] <- sum(df[(k-d+1):k,2] * w)                          
      }
    }
  }
  return(rainsum)
}

out1 <- rainSUM(file)

Attempt in Rcpp

#include <Rcpp.h>
using namespace Rcpp;

// [[Rcpp::export]]

NumericVector myseq(int first, int last) {  // simulate R's X:Y sequence (step of 1)
  NumericVector y(0);
  for (int i=first; i<=last; ++i)
    y.push_back(i);
  return(y);
}

// [[Rcpp::export]]   
NumericVector splicer(NumericVector vec, int first, int last) {  // splicer
  NumericVector y(0);
  for (int i=first; i<=last; ++i)
    y.push_back(vec[i]);
  return(y);
}

// [[Rcpp::export]]              
NumericVector weighty(int d) {             // calculate inverse linear weight according to the number of days in lag 
  NumericVector a = myseq(1,d);            // sequence 1:d; length d
  NumericVector b = (a-1)/a;               // inverse linear
  return(b);                               // return vector                              
} 

// [[Rcpp::export]]              
NumericMatrix rainsumCPP(DataFrame df, int raincol) {
  NumericVector q(0);
  NumericMatrix rainsum(df.nrows(), 11);       // matrix with number of row days as data file and 11 columns 
  NumericVector p = df( raincol-1 );           // grab rain values (remember C++ first index is 0)
  for(int i = 0; i <= 10; i++) {                // loop through 11 columns (C++ index starts at 0!)
    if (i==0) {
      int d = 183;                             // 366*years lag days 
      NumericVector w = weighty(d);            // get weights for this lag series
      for(int k = 0; k < df.nrows(); k++) {    // loop through days (rows)
        if(d>k){                               // if not enough lag days for row, use what's available
          NumericVector m = splicer(p, 0, k);  // subset rain values according to the day being considered
          NumericVector u = splicer(w, (d-k), (d-1));   // same for weight
          m = m*u;                              // multiply rain values by weights
          rainsum(k,i) = sum(m);               // add the sum of the weighted rain to the rainsum matrix
        } else{
          NumericVector m = splicer(p, k-d+1, k);
          m = m*w;
          rainsum(k,i) = sum(m);
        }
      }
    }
    else {
      int d = i*366;                           // 183 lag days if column 0
      NumericVector w = weighty(d);            // get weights for this lag series
      for(int k = 0; k < df.nrows(); k++) {    // loop through days (rows)
        if(d>k){                               // if not enough lag days for row, use what's available
          NumericVector m = splicer(p, 0, k);  // subset rain values according to the day being considered
          NumericVector u = splicer(w, (d-k), (d-1));   // same for weight
          m = m*u;                             // multiply rain values by weights
          rainsum(k,i) = sum(m);               // add the sum of the weighted rain to the rainsum matrix
        } else{
          NumericVector m = splicer(p, k-d+1, k);
          m = m*w;
          rainsum(k,i) = sum(m);
        }
      } 
    }
  }
  return(rainsum);
}


/*** R
out2 = rainsumCPP(file, raincol) # raincol currently = 2
  */
like image 898
user8761424 Avatar asked Oct 30 '22 00:10

user8761424


1 Answers

Congratulations! You have an index out of bounds (OOB) error causing an undefined behavior (UB)! You can detect this in the future by changing the vector accessor from [] to () and for the matrix accessor from () to .at().

Switching to these accessors yields:

Error in rainsumCPP(file, 2) : 
  Index out of bounds: [index=182; extent=182].

which indicates an index is out of bounds as the index must always be between 0 and 1 less than the extent (e.g. length of vector - 1).

Preliminary glances indicates that this issue is largely caused by not correctly mapping one-based indexing to zero-based indexing.

Upon playing around with the myseq(), splicer(), and weighty() functions, they do not match their R equivalent given inputs. This can be checked by using all.equal(R_result, Rcpp_Result). This mismatch is in two parts: 1. the bounds of both myseq and splicer and 2. inversion of done inside weighty.

So, by using the following functions that were modified, you should be on a good basis for obtaining the correct results.

// [[Rcpp::export]]
NumericVector myseq(int first, int last) {  // simulate R's X:Y sequence (step of 1)
  int vec_len = abs(last - first);

  NumericVector y = no_init(vec_len);

  int count = 0;
  for (int i = first; i < last; ++i) {
    y(count) = count;
    count++;
  }

  return y;
}

// [[Rcpp::export]]   
NumericVector splicer(NumericVector vec, int first, int last) {  // splicer

  int vec_len = abs(last - first);

  NumericVector y = no_init(vec_len);

  int count = 0;
  for (int i = first; i < last; ++i) {
    y(count) = vec(i);

    count++;
  }

  return y;
}

// [[Rcpp::export]]              
NumericVector weighty(int d) {       // calculate inverse linear weight according to the number of days in lag 
  NumericVector a = myseq(0, d - 1); // (fixed) sequence 1:d; length d
  NumericVector b = a / d;           // (fixed) inverse linear
  return(b);                         // return vector                              
}

From there, you will likely need to modify the rainsumCpp as no output was given for what the R equivalent was.

like image 196
coatless Avatar answered Nov 15 '22 07:11

coatless