Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Rapidly generating ~ 10^9 steps of a random process in R

I have a following task to perform:

Generate 10^9 steps of the process described by the formula:

X(0)=0
X(t+1)=X(t)+Y(t)

where Y(t) are independent random variables with the distribution N(0,1). Calculate in what percentage of indices t the value of X(t) was negative.

I tried the following code:

  x<-c(0,0)
  z<-0
  loop<-10^9
  for(i in 2:loop) {
    x[1]<-x[2]
    x[2]<-x[1]+rnorm(1, 0, 1)
    if (x[2]<0) {z<-z+1}
  }

However, it is very slow. How can I speed it up?

like image 328
wojciesz Avatar asked Dec 23 '17 15:12

wojciesz


3 Answers

In general, for problems like this, you can translate your function one-to-one into C++ using the Rcpp package. This should give a considerable speedup.

First, the R version:

random_sum <- function(loop = 1000) {
  x<-c(0,0)
  z<-0
  for(i in 2:loop) {
    x[1]<-x[2]
    x[2]<-x[1]+rnorm(1, 0, 1)
    if (x[2]<0) {z<-z+1}
  }
  z / loop
}
set.seed(123)
random_sum()
# [1] 0.134

Now the C++ version:

library("Rcpp")
cppFunction("
  double random_sum_cpp(unsigned long loop = 1000) {
    double x1 = 0;
    double x2 = 0;
    double z = 0;
    for (unsigned long i = 2; i < loop; i++) {
      x1 = x2;
      x2 = x1 + Rcpp::rnorm(1)[0];
      if (x2 < 0) z = z+1;
    }
    return z/loop;
  }")

set.seed(123)
random_sum_cpp()
# [1] 0.134

For completeness, let's also consider the vectorized version that was proposed:

random_sum_vector <- function(loop = 1000) {
  Y = rnorm(loop)
  sum(cumsum(Y)<0)/loop
}
set.seed(123)
random_sum_vector()
# [1] 0.134

We see that it gives the same result for the same random seed, so it seems a viable contender.

In the benchmark, the C++ version and the vectorized version perform similarly, with the vectorized version displaying a slight edge over the C++ version:

> microbenchmark(random_sum(100000),
                 random_sum_vector(100000),
                 random_sum_cpp(100000))
Unit: milliseconds
                     expr        min         lq       mean     median         uq       max neval
        random_sum(1e+05) 184.205588 199.859266 209.220232 205.137043 211.026740 274.47615   100
 random_sum_vector(1e+05)   6.320690   6.631704   7.273645   6.799093   7.334733  18.48649   100
    random_sum_cpp(1e+05)   8.950091   9.362303  10.663295   9.956996  11.079513  21.30898   100

However, the vectorized version trades off speed with memory and will blow up your memory for long loops. The C++ version uses virtually no memory.

For 10^9 steps, the C++ version runs in about 2 minutes (110 seconds) on my machine. I didn't try the R version. Based on the shorter benchmarks, it would probably take around 7 hours.

> microbenchmark(random_sum_cpp(10^9), times = 1)
Unit: seconds
                 expr      min       lq     mean   median       uq      max neval
 random_sum_cpp(10^9) 110.2182 110.2182 110.2182 110.2182 110.2182 110.2182     1
like image 123
Claus Wilke Avatar answered Nov 20 '22 13:11

Claus Wilke


This should be much faster, but a billion of anything may take a while. It might be good to test this out with smaller values of length - like 10^6.

length = 10^9
Y = rnorm(length)
sum(cumsum(Y)<0)/length

EDIT

Based on the comments of @user3666197 I tested this and he was correct. This solution works well for smaller numbers but once the number of steps gets to be too large, it fails.

I tested my "vectorized" version against the OP's code. When the length of the random walk was 10^8, my code took about 7 seconds and the OP's code took 131 seconds (on my laptop). However, when I increased the length to 10^9 (as per the original question), my version caused a lot of disk swapping and I had to kill the process. This solution fails at the scale requested by the OP.

like image 4
G5W Avatar answered Nov 20 '22 14:11

G5W


One solution is to go with the vectorized proposed by @G5W, but break it into smaller chunks to avoid any memory overflow issues. This gives you the speed of the vectorized solution, but by managing the chunk size you can control how much memory the process uses.

The following breaks the problem into blocks of 1e+07, and by looping 100 times you get a total of 1e+09.

At the end of the first block, you record the percent of time below 0, and the ending point. The ending point is then fed to the next block, and you record the percent of time below 0, and the new ending point.

At the end, average the 100 runs to get the total amount of time below zero. The calls to cat in the while loop are to monitor progress and see the progression, this can be commented out.

funky <- function(start, length = 1e+07) {
  Y <- rnorm(length)
  Z <- cumsum(Y)
  c(sum(Z<(-start))/length, (tail(Z, 1) + start))
}

starttime <- Sys.time()
resvect <- vector(mode = "numeric", length = 100)
result <- funky(0)
resvect[1] <- result[1]
i <- 2
while (i < 101) {
  cat(result, "\n")
  result <- funky(result[2])
  resvect[i] <- result[1]
  i <- i + 1
}
mean(resvect)
# [1] 0.1880392
endtime <- Sys.time()
elapsed <- endtime - starttime
elapsed
# Time difference of 1.207566 mins
like image 3
Eric Watt Avatar answered Nov 20 '22 13:11

Eric Watt