Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fastest way to compute the cdf of the Normal distribution over vectors - R::pnorm vs erfc vs?

I hope my reworded question now fits the criteria of Stackoverflow. Please consider the example below. I am writing a Log-Likelihood function in which computing the cdf over vectors is the most time consuming part. Example 1 uses the R::pnorm, Example 2 approximates the normal cdf with erfc. As you can see the results are sufficiently similar, the ercf version is a bit faster.

In practice (within an MLE) however it turns out that the ercf is not as precise, which lets the algorithm run into inf areas unless one sets the constraints accurately. My questions:

1) Am I missing something? Is it necessary to implement some error handling (for the erfc)?

2) Do you have any other suggestions to speed up the code, or alternatives? Does it pay off to look into parallelizing the for-loop?

require(Rcpp)
require(RcppArmadillo)
require(microbenchmark)

#Example 1 : standard R::pnorm
src1 <- '
NumericVector ppnorm(const arma::vec& x,const arma::vec& mu,const     arma::vec& sigma, int lt, int lg) {
int n = x.size();
arma::vec res(n);
for (int i=0; i<n; i++) {
   res(i) = R::pnorm(x(i),mu(i),sigma(i),lt,lg);
}
return wrap(res);
}
'

#Example 2: approximation with ercf
src2 <- '
NumericVector ppnorm(const arma::vec& x,const arma::vec& mu,const    arma::vec& sigma, int lt, int lg) {
int n = x.size();
arma::vec res(n);
for (int i=0; i<n; i++) {
res(i) = 0.5 * erfc(-(x(i) - mu(i))/sigma(i) * M_SQRT1_2);
}
if (lt==0 & lg==0) {
   return wrap(1 - res);
}
if (lt==1 & lg==0) {
   return wrap(res);
}
if (lt==0 & lg==1) {
   return wrap(log(1 - res));
}
if (lt==1 & lg==1) {
   return wrap(log(res));
}
}
'

#some random numbers
xex  = rnorm(100,5,4)
muex = rnorm(100,3,1)
siex = rnorm(100,0.8,0.3)

#compile c++ functions 
func1 = cppFunction(depends = "RcppArmadillo",code=src1) #R::pnorm
func2 = cppFunction(depends = "RcppArmadillo",code=src2) #ercf

#run with exemplaric data
res1 = func1(xex,muex,siex,1,0)
res2 = func2(xex,muex,siex,1,0)

# sum of squared errors
sum((res1 - res2)^2,na.rm=T)
# 6.474419e-32 ... very small

#benchmarking
 microbenchmark(func1(xex,muex,siex,1,0),func2(xex,muex,siex,1,0),times=10000)
#Unit: microseconds
#expr    min      lq     mean median     uq     max neval
#func1(xex, muex, siex, 1, 0) 11.225 11.9725 13.72518 12.460 13.617 103.654 10000
#func2(xex, muex, siex, 1, 0)  8.360  9.1410 10.62114  9.669 10.769 205.784 10000
#my machine: Ubuntu 14.04 LTS, i7 2640M 2.8 Ghz x 4, 8GB memory, RRO 3.2.0 based on version R 3.2.0
like image 541
chameau13 Avatar asked May 19 '15 15:05

chameau13


People also ask

What is the difference between Pnorm and Dnorm in R?

dnorm gives the density, pnorm gives the distribution function, qnorm gives the quantile function, and rnorm generates random deviates.

What does the Pnorm () function do?

The pnorm function provides the cumulative density of the normal distribution at a specific quantile. The qnorm function provides the quantile of the normal distribution at a specified cumulative density.

How do you find the normal distribution of CDF?

We usually denote the standard normal CDF by Φ. The CDF of the standard normal distribution is denoted by the Φ function: Φ(x)=P(Z≤x)=1√2π∫x−∞exp{−u22}du. As we will see in a moment, the CDF of any normal random variable can be written in terms of the Φ function, so the Φ function is widely used in probability.

What does Pnorm mean in R?

pnorm function This function returns the value of the cumulative density function (cdf) of the normal distribution given a certain random variable q, a population mean μ, and the population standard deviation σ. Syntax: pnorm(q, mean, sd,lower.tail) Parameters: q: It is a vector of quantiles.


1 Answers

1) Well, you really should use R's pnorm() as your 0-th example. You don't, you use the Rcpp interface to it. R's pnorm() is already nicely vectorized R-internally (i.e. on C level) so may well be comparative or even faster than Rcpp. Also it does have the advantage to cover cases of NA, NaN, Inf, etc..

2) If you are talking about MLE, and you are concerned about speed and accuracy, you almost surely should rather work with the logarithms, and maybe not withpnorm() but rather dnorm() ?

like image 89
Martin Mächler Avatar answered Sep 21 '22 11:09

Martin Mächler