How can I vectorize the multivariate normal CDF (cumulative density function) in Python?
When looking at this post, I found out that there is a Fortran implementation of the multivariate CDF that was "ported" over to Python. This means I can easily evaluate the CDF for one specific case.
However, I'm having a lot of trouble efficiently applying this function to multiple entries.
Specifically speaking, the function I need to "vectorize" takes 4 arguments:
But I'm trying to efficiently evaluate this function over a list of 1000+ elements MANY times.
Here is some code to illustrate my problem. In the example below I'm just using random data to illustrate my point.
import time
import numpy as np
from scipy.stats.mvn import mvnun # library that calculates MVN CDF
np.random.seed(666)
iters = 1000 # number of times the whole dataset will be evaluated
obs = 1500 # number of elements in the dataset
dim = 2 # dimension of multivariate normal distribution
lower = np.random.rand(obs,dim)
upper = lower + np.random.rand(obs,dim)
means = np.random.rand(obs,dim)
# Creates a symmetric matrix - used for the random covariance matrices
def make_sym_matrix(dim,vals):
m = np.zeros([dim,dim])
xs,ys = np.triu_indices(dim,k=1)
m[xs,ys] = vals[:-dim]
m[ys,xs] = vals[:-dim]
m[ np.diag_indices(dim) ] = vals[-dim:]
return m
# Generating the random covariance matrices
covs = []
for i in range(obs):
cov_vals = np.random.rand(int((dim**2 + dim)/2))
cov_mtx = make_sym_matrix(dim,cov_vals)
covs.append(cov_mtx)
covs = np.array(covs)
# Here is where the trouble starts.
time_start = time.time()
for i in range(iters):
results = []
for j in range(obs):
this_p, this_i = mvnun(lower[j],upper[j],means[j],covs[j])
results.append(this_p)
time_end = time.time()
print(time_end-time_start)
Here I have a dataset with 1500 observations which I'm evaluating 1000 times. In my machine, this takes 6.74399995804 seconds to calculate.
Note that I'm not trying to get rid of the outer for-loop (over i). I just created that to mimic my real problem. The for-loop that I'm really trying to eliminate is the internal one (over j).
The execution time could be vastly reduced if I found a way to efficiently evaluate the CDF over the whole dataset.
I know that the mvnun function was originally written in Fortran (original code here) and "ported" to Python using f2pye as can be seen here.
Can anyone help me out with this? I've started looking at theano, but it seems that the only option I have there is to use the scan function, which might not be much of an improvement either.
Thank you!!!
This is only a partial answer, but there is a way to increase speed if the dimension of the multivariate normal distribution is small (2 or 3) and if the covariance matrix stays the same.
import numpy as np
import openturns as ot
def computeRectangularDomainProbability(lower, upper, means, cov_matrix):
"""
Compute the probability of a rectangular solid
under a multinormal distribution.
"""
# Center the bounds of the rectangular solid on the mean
lower -= means
upper -= means
# The same covariance matrix for all rectangular solids.
cov_matrix = ot.CovarianceMatrix(cov_matrix)
# This way, we only need to define one multivariate normal distribution.
# That is the trick that allows vectorization.
dimension = lower.shape[1]
multinormal = ot.Normal([0.0] * dimension, cov_matrix)
# The probability of the rectangular solid is a weighted sum
# of the CDF of the vertices (with weights equal to 1 or -1).
# The following block computes the CDFs and applies the correct weight.
full_reverse_binary = np.array(list(bin(2**dimension)[:1:-1]), dtype=int)
prob = 0.0
for i in range(2**dimension):
reverse_binary = np.array(list(bin(i)[:1:-1]), dtype=int)
reverse_binary = np.append(reverse_binary,
np.zeros(len(full_reverse_binary) -
len(reverse_binary) -
1)).astype(int)
point = np.zeros(lower.shape)
for num, digit in enumerate(reverse_binary):
if digit:
point[:, num] = upper[:, num]
else:
point[:, num] = lower[:, num]
cdf = np.array(multinormal.computeCDF(point))
if (reverse_binary.sum() % 2) == (dimension % 2):
prob += cdf
else:
prob -= cdf
return prob.reshape(-1,)
iters = 1000 # loop size
obs = 1500 # number of rectangular solids
dim = 2 # dimension of multivariate normal distribution
import time
import numpy as np
from scipy.stats.mvn import mvnun # library that calculates MVN CDF
from sklearn.datasets import make_spd_matrix
import openturns as ot
time_mvnun = 0.0
time_openturns = 0.0
discrepancy = 0.0
np.random.seed(0)
for iteration in range(iters):
lower = np.random.rand(obs,dim)
upper = lower + np.random.rand(obs,dim)
means = np.random.rand(obs,dim)
# Generating the random covariance matrices with sklearn
# to make sure they are positive semi-definite
cov_mtx = make_spd_matrix(dim)
time_start = time.time()
results = []
for j in range(obs):
this_p, this_i = mvnun(lower[j],upper[j],means[j],cov_mtx)
results.append(this_p)
results = np.array(results)
time_end = time.time()
time_mvnun += time_end - time_start
time_start = time.time()
otparallel = computeRectangularDomainProbability(lower, upper, means, cov_mtx)
time_end = time.time()
time_openturns += time_end - time_start
mvnorm_vs_otparallel = np.abs(results - otparallel).sum()
discrepancy += mvnorm_vs_otparallel
print('Dimension {}'.format(dim))
# Print computation time
print('mvnun time: {0:e}'.format(time_mvnun))
print('openturns time: {0:e}'.format(time_openturns))
print('ratio mvnun/ot: {0:f}'.format(time_mvnun / time_openturns))
# Check that the results are the same for mvnum and openturns
print('mvnun-openturns result discrepancy: {0:e}'.format(discrepancy))
Output on my machine:
Dimension 2
mvnun time: 4.040635e+00
openturns time: 3.588211e+00
ratio mvnun/ot: 1.126086
mvnun-openturns result discrepancy: 8.057912e-11
There is a slight speedup: a little more than 10%.
Let us change the global variables controlling the script.
iters = 100 # loop size
obs = 1500 # number of rectangular solids
dim = 3 # dimension of multivariate normal distribution
Output on my machine:
Dimension 3
mvnun time: 2.378337e+01
openturns time: 1.596872e+00
ratio mvnun/ot: 14.893725
mvnun-openturns result discrepancy: 4.537064e-03
The gain is much more significant in dimension 3: the proposed code is 15 times faster.
Unfortunately, openturns slows down a lot with dimension 4. It contains a smart implementation of the CDF for dimensions 1, 2 and 3 but falls back on a slower, more generic implementation for dimensions greater than 3.
iters = 1 # loop size
obs = 15 # number of rectangular solids
dim = 4 # dimension of multivariate normal distribution
Dimension 4
mvnun time: 7.289171e-03
openturns time: 3.689714e+01
ratio mvnun/ot: 0.000198
mvnun-openturns result discrepancy: 6.297527e-07
In dimension 4, the proposed code is slower by about 4 orders of magnitude! This is probably because in dimension 4, it needs to compute 16=2^4 CDFs per rectangular solid, and each of these computations is slower than in smaller dimensions.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With