I have a following task to perform:
Generate 10^9 steps of the process described by the formula:
X(0)=0 X(t+1)=X(t)+Y(t)
where
Y(t)
are independent random variables with the distributionN(0,1)
. Calculate in what percentage of indicest
the value ofX(t)
was negative.
I tried the following code:
x<-c(0,0)
z<-0
loop<-10^9
for(i in 2:loop) {
x[1]<-x[2]
x[2]<-x[1]+rnorm(1, 0, 1)
if (x[2]<0) {z<-z+1}
}
However, it is very slow. How can I speed it up?
In general, for problems like this, you can translate your function one-to-one into C++ using the Rcpp package. This should give a considerable speedup.
First, the R version:
random_sum <- function(loop = 1000) {
x<-c(0,0)
z<-0
for(i in 2:loop) {
x[1]<-x[2]
x[2]<-x[1]+rnorm(1, 0, 1)
if (x[2]<0) {z<-z+1}
}
z / loop
}
set.seed(123)
random_sum()
# [1] 0.134
Now the C++ version:
library("Rcpp")
cppFunction("
double random_sum_cpp(unsigned long loop = 1000) {
double x1 = 0;
double x2 = 0;
double z = 0;
for (unsigned long i = 2; i < loop; i++) {
x1 = x2;
x2 = x1 + Rcpp::rnorm(1)[0];
if (x2 < 0) z = z+1;
}
return z/loop;
}")
set.seed(123)
random_sum_cpp()
# [1] 0.134
For completeness, let's also consider the vectorized version that was proposed:
random_sum_vector <- function(loop = 1000) {
Y = rnorm(loop)
sum(cumsum(Y)<0)/loop
}
set.seed(123)
random_sum_vector()
# [1] 0.134
We see that it gives the same result for the same random seed, so it seems a viable contender.
In the benchmark, the C++ version and the vectorized version perform similarly, with the vectorized version displaying a slight edge over the C++ version:
> microbenchmark(random_sum(100000),
random_sum_vector(100000),
random_sum_cpp(100000))
Unit: milliseconds
expr min lq mean median uq max neval
random_sum(1e+05) 184.205588 199.859266 209.220232 205.137043 211.026740 274.47615 100
random_sum_vector(1e+05) 6.320690 6.631704 7.273645 6.799093 7.334733 18.48649 100
random_sum_cpp(1e+05) 8.950091 9.362303 10.663295 9.956996 11.079513 21.30898 100
However, the vectorized version trades off speed with memory and will blow up your memory for long loops. The C++ version uses virtually no memory.
For 10^9 steps, the C++ version runs in about 2 minutes (110 seconds) on my machine. I didn't try the R version. Based on the shorter benchmarks, it would probably take around 7 hours.
> microbenchmark(random_sum_cpp(10^9), times = 1)
Unit: seconds
expr min lq mean median uq max neval
random_sum_cpp(10^9) 110.2182 110.2182 110.2182 110.2182 110.2182 110.2182 1
This should be much faster, but a billion of anything may take a while. It might be good to test this out with smaller values of length - like 10^6.
length = 10^9
Y = rnorm(length)
sum(cumsum(Y)<0)/length
Based on the comments of @user3666197 I tested this and he was correct. This solution works well for smaller numbers but once the number of steps gets to be too large, it fails.
I tested my "vectorized" version against the OP's code. When the length of the random walk was 10^8, my code took about 7 seconds and the OP's code took 131 seconds (on my laptop). However, when I increased the length to 10^9 (as per the original question), my version caused a lot of disk swapping and I had to kill the process. This solution fails at the scale requested by the OP.
One solution is to go with the vectorized proposed by @G5W, but break it into smaller chunks to avoid any memory overflow issues. This gives you the speed of the vectorized solution, but by managing the chunk size you can control how much memory the process uses.
The following breaks the problem into blocks of 1e+07, and by looping 100 times you get a total of 1e+09.
At the end of the first block, you record the percent of time below 0, and the ending point. The ending point is then fed to the next block, and you record the percent of time below 0, and the new ending point.
At the end, average the 100 runs to get the total amount of time below zero. The calls to cat
in the while loop are to monitor progress and see the progression, this can be commented out.
funky <- function(start, length = 1e+07) {
Y <- rnorm(length)
Z <- cumsum(Y)
c(sum(Z<(-start))/length, (tail(Z, 1) + start))
}
starttime <- Sys.time()
resvect <- vector(mode = "numeric", length = 100)
result <- funky(0)
resvect[1] <- result[1]
i <- 2
while (i < 101) {
cat(result, "\n")
result <- funky(result[2])
resvect[i] <- result[1]
i <- i + 1
}
mean(resvect)
# [1] 0.1880392
endtime <- Sys.time()
elapsed <- endtime - starttime
elapsed
# Time difference of 1.207566 mins
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With