Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to speed up or vectorize a for loop?

I would like to increase the speed of my for loop via vectorization or using Data.table or something else. I have to run the code on 1,000,000 rows and my code is really slow.

The code is fairly self-explanatory. I have included an explanation below just in case. I have included the input and the output of the function. Hopefully you will help me make the function faster.

My goal is to bin the vector "Volume", where each bin is equal to 100 shares. The vector "Volume" contains the number of shares traded. Here is what it looks like:

head(Volume, n = 60)
[1]  5  3  1  5  3  1  1  1  1  1  1  1 18  1  1 18  2  7 13  2  7 13  3  2  1  1  3  2  1  1  1
[32]  1  6  6  1  1  1  1  1  1  1  1 18  2  1  1  2  1 14 18  2  1  1  2  1 14  1  1  9  5

The vector "binIdexVector" is the same length of "Volume", and it contains the bin number; that is each element of the first 100 shares get the number 1, each elements of the next 100 shares get the number 2, each elements of the next 100 shares get the number 3, and so on. Here is what that vector looks like:

 head(binIdexVector, n = 60)
[1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
[48] 2 2 3 3 3 3 3 3 3 3 3 3 3

Here is my function:

#input as a vector
Volume<-c(5L, 3L, 1L, 5L, 3L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 18L, 1L, 1L, 
                   18L, 2L, 7L, 13L, 2L, 7L, 13L, 3L, 2L, 1L, 1L, 3L, 2L, 1L, 1L, 
                   1L, 1L, 6L, 6L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 18L, 2L, 1L, 
                   1L, 2L, 1L, 14L, 18L, 2L, 1L, 1L, 2L, 1L, 14L, 1L, 1L, 9L, 5L, 
                   2L, 1L, 1L, 1L, 1L, 9L, 5L, 2L, 1L, 1L, 1L, 2L, 1L, 1L, 3L, 1L, 
                   1L, 2L, 1L, 2L, 1L, 1L, 3L, 1L, 1L, 2L, 9L, 9L, 3L, 3L, 1L, 1L, 
                   1L, 1L, 5L, 5L, 8L, 8L, 2L, 1L, 2L, 1L, 10L, 10L, 10L, 10L, 10L, 
                   10L, 10L, 10L, 9L, 9L, 1L, 1L, 8L, 1L, 8L, 1L, 8L, 8L, 2L, 1L, 
                   1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 
                   1L, 1L, 3L, 3L, 3L, 3L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 5L, 5L, 
                   1L, 2L, 7L, 1L, 2L, 7L, 1L, 1L, 1L, 1L, 2L, 1L, 10L, 1L, 1L, 
                   1L, 1L, 1L, 1L, 2L, 1L, 10L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
                   1L, 1L, 30L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 3L, 
                   1L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 2L, 2L, 2L, 1L, 2L, 1L, 
                   10L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 10L, 1L, 1L, 1L, 1L, 1L, 
                   1L, 1L, 1L, 1L, 1L, 30L, 1L, 2L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
                   1L, 1L, 3L, 1L, 2L, 2L, 1L, 1L, 1L, 1L, 2L, 2L, 1L, 2L, 2L, 2L, 
                   1L, 1L, 1L, 1L, 1L, 1L, 1L, 7L, 7L, 3L, 1L, 1L, 1L, 4L, 3L, 1L, 
                   1L, 1L, 4L, 25L, 1L, 1L, 25L, 1L, 1L, 2L, 1L, 1L, 2L, 1L, 1L, 
                   1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L)

binIdexVector <- numeric(length(Volume))

# initialize 
binIdex <-1
totalVolume <-0

for(i in seq_len(length(Volume))){

  totalVolume <- totalVolume + Volume[i]  

  if (totalVolume <= 100) {

    binIdexVector[i] <- binIdex

  } else {

    binIdex <- binIdex + 1
    binIdexVector[i] <- binIdex
    totalVolume <- Volume[i]
  }
}

# output:
> dput(binIdexVector)
c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
  1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 
  2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 
  3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 
  3, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 
  4, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 6, 6, 6, 6, 6, 6, 
  6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 
  6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 7, 
  7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 
  7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 7, 
  7, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 
  8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 8, 
  8, 8, 8, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 
  9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 9, 10, 10, 10, 10, 10, 10, 10, 
  10, 10, 10, 10, 10, 10, 10, 10, 10, 10)

Thank a lot for your help!

> sessionInfo()
R version 3.1.2 (2014-10-31)
Platform: x86_64-w64-mingw32/x64 (64-bit)

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
[1] tools_3.1.2
like image 901
user3602239 Avatar asked Mar 14 '15 21:03

user3602239


People also ask

Is vectorization faster than looping?

Execution time: relative on the left, absolute on the right. In the right plot you see the execution times of the two operations: the vectorized version is MUCH faster than the looped one. How much faster?

What is vectorization in C++?

So just to clarify, vectorization is just a function that uses a regular for loop which has been implemented in a lower language such as Fortran or C++. In short, someone else did the dirty work already and wrote a for loop for us. With vectorization we get the speed of lower level language loops.

How to vectorize a function in R?

base::Vectorize () converts a scalar function to a vector function. base::Vectorize () is a base R function that vectorized our non-vectorized if_else_statement () scalar function. Another good way to vectorize functions would be with the purrr package.

How much faster is the vectorized version of NumPy?

You see this in the left plot: the vectorized version is executed in less than 1.3% of the time! Actually when we use the broadcasting capabilities of Numpy like we did in the previous post, under the hood all the operations are automatically vectorized. So using broadcasting not only speed up writing code, it’s also faster the execution of it!


1 Answers

You can use Rcpp when vectorization is difficult.

library(Rcpp)
cppFunction('
  IntegerVector bin(NumericVector Volume, int n) {
    IntegerVector binIdexVector(Volume.size());
    int binIdex = 1;
    double totalVolume =0;

    for(int i=0; i<Volume.size(); i++){
      totalVolume = totalVolume + Volume[i];
      if (totalVolume <= n) {
        binIdexVector[i] = binIdex;
      } else {
        binIdex++;
        binIdexVector[i] = binIdex;
        totalVolume = Volume[i];
      }
    }
    return binIdexVector;
  }')

all.equal(bin(Volume, 100), binIdexVector)
#[1] TRUE

It's faster than findInterval(cumsum(Volume), seq(0, sum(Volume), by=100)) (which of course gives an inexact answer)

like image 127
Khashaa Avatar answered Oct 02 '22 18:10

Khashaa