Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Vectorizing the multivariate normal CDF (cumulative density function) in Python

How can I vectorize the multivariate normal CDF (cumulative density function) in Python?

When looking at this post, I found out that there is a Fortran implementation of the multivariate CDF that was "ported" over to Python. This means I can easily evaluate the CDF for one specific case.

However, I'm having a lot of trouble efficiently applying this function to multiple entries.

Specifically speaking, the function I need to "vectorize" takes 4 arguments:

  • the lower bounds of integration(vector)
  • the upper bounds of integration (vector)
  • the means of the normal random variables (vector)
  • the covariance matrix of the normal random variables (matrix)

But I'm trying to efficiently evaluate this function over a list of 1000+ elements MANY times.

Here is some code to illustrate my problem. In the example below I'm just using random data to illustrate my point.

import time
import numpy as np
from scipy.stats.mvn import mvnun # library that calculates MVN CDF

np.random.seed(666)

iters = 1000 # number of times the whole dataset will be evaluated
obs = 1500 # number of elements in the dataset
dim = 2 # dimension of multivariate normal distribution

lower = np.random.rand(obs,dim)
upper = lower + np.random.rand(obs,dim)
means = np.random.rand(obs,dim)

# Creates a symmetric matrix - used for the random covariance matrices
def make_sym_matrix(dim,vals):
    m = np.zeros([dim,dim])
    xs,ys = np.triu_indices(dim,k=1)
    m[xs,ys] = vals[:-dim]
    m[ys,xs] = vals[:-dim]
    m[ np.diag_indices(dim) ] = vals[-dim:]
    return m

# Generating the random covariance matrices
covs = []
for i in range(obs):
    cov_vals = np.random.rand(int((dim**2 + dim)/2))
    cov_mtx = make_sym_matrix(dim,cov_vals)
    covs.append(cov_mtx)
covs = np.array(covs)

# Here is where the trouble starts.
time_start = time.time()
for i in range(iters):
    results = []
    for j in range(obs):
        this_p, this_i = mvnun(lower[j],upper[j],means[j],covs[j])
        results.append(this_p)
time_end = time.time()

print(time_end-time_start)

Here I have a dataset with 1500 observations which I'm evaluating 1000 times. In my machine, this takes 6.74399995804 seconds to calculate.

Note that I'm not trying to get rid of the outer for-loop (over i). I just created that to mimic my real problem. The for-loop that I'm really trying to eliminate is the internal one (over j).

The execution time could be vastly reduced if I found a way to efficiently evaluate the CDF over the whole dataset.

I know that the mvnun function was originally written in Fortran (original code here) and "ported" to Python using f2pye as can be seen here.

Can anyone help me out with this? I've started looking at theano, but it seems that the only option I have there is to use the scan function, which might not be much of an improvement either.

Thank you!!!

like image 217
Felipe D. Avatar asked Sep 25 '17 02:09

Felipe D.


1 Answers

This is only a partial answer, but there is a way to increase speed if the dimension of the multivariate normal distribution is small (2 or 3) and if the covariance matrix stays the same.

import numpy as np
import openturns as ot

def computeRectangularDomainProbability(lower, upper, means, cov_matrix):
    """
    Compute the probability of a rectangular solid
    under a multinormal distribution.
    
    """
    # Center the bounds of the rectangular solid on the mean
    lower -= means
    upper -= means
    
    # The same covariance matrix for all rectangular solids.
    cov_matrix = ot.CovarianceMatrix(cov_matrix)
    
    # This way, we only need to define one multivariate normal distribution.
    # That is the trick that allows vectorization.
    dimension = lower.shape[1]   
    multinormal = ot.Normal([0.0] * dimension, cov_matrix)
    
    # The probability of the rectangular solid is a weighted sum
    # of the CDF of the vertices (with weights equal to 1 or -1).
    # The following block computes the CDFs and applies the correct weight.
    full_reverse_binary = np.array(list(bin(2**dimension)[:1:-1]), dtype=int)
    prob = 0.0
    for i in range(2**dimension):
        reverse_binary = np.array(list(bin(i)[:1:-1]), dtype=int)
        reverse_binary = np.append(reverse_binary, 
                                   np.zeros(len(full_reverse_binary) - 
                                            len(reverse_binary) -
                                            1)).astype(int)
        point = np.zeros(lower.shape)
        for num, digit in enumerate(reverse_binary):
            if digit:
                point[:, num] = upper[:, num]
            else:
                point[:, num] = lower[:, num]
        cdf = np.array(multinormal.computeCDF(point))
        if (reverse_binary.sum() % 2) == (dimension % 2):
            prob += cdf
        else:
            prob -= cdf
    
    return prob.reshape(-1,)

Test script: dimension 2

iters = 1000 # loop size
obs = 1500 # number of rectangular solids
dim = 2 # dimension of multivariate normal distribution
import time
import numpy as np
from scipy.stats.mvn import mvnun # library that calculates MVN CDF
from sklearn.datasets import make_spd_matrix
import openturns as ot

time_mvnun = 0.0
time_openturns = 0.0
discrepancy = 0.0
np.random.seed(0)

for iteration in range(iters):

    lower = np.random.rand(obs,dim)
    upper = lower + np.random.rand(obs,dim)
    means = np.random.rand(obs,dim)
    
    # Generating the random covariance matrices with sklearn
    # to make sure they are positive semi-definite        
    cov_mtx = make_spd_matrix(dim)
        
    time_start = time.time()
    results = []
    for j in range(obs):
        this_p, this_i = mvnun(lower[j],upper[j],means[j],cov_mtx)
        results.append(this_p)
    results = np.array(results)
    time_end = time.time()
    time_mvnun += time_end - time_start
    
    
    time_start = time.time()       
    otparallel = computeRectangularDomainProbability(lower, upper, means, cov_mtx)
    time_end = time.time()
    time_openturns += time_end - time_start
    
    mvnorm_vs_otparallel = np.abs(results - otparallel).sum()
    discrepancy += mvnorm_vs_otparallel

print('Dimension {}'.format(dim))

# Print computation time
print('mvnun     time: {0:e}'.format(time_mvnun))
print('openturns time: {0:e}'.format(time_openturns))
print('ratio mvnun/ot: {0:f}'.format(time_mvnun / time_openturns))

# Check that the results are the same for mvnum and openturns
print('mvnun-openturns result discrepancy: {0:e}'.format(discrepancy))

Output on my machine:

Dimension 2
mvnun     time: 4.040635e+00
openturns time: 3.588211e+00
ratio mvnun/ot: 1.126086
mvnun-openturns result discrepancy: 8.057912e-11

There is a slight speedup: a little more than 10%.

Dimension 3

Let us change the global variables controlling the script.

iters = 100 # loop size
obs = 1500 # number of rectangular solids
dim = 3 # dimension of multivariate normal distribution

Output on my machine:

Dimension 3
mvnun     time: 2.378337e+01
openturns time: 1.596872e+00
ratio mvnun/ot: 14.893725
mvnun-openturns result discrepancy: 4.537064e-03

The gain is much more significant in dimension 3: the proposed code is 15 times faster.

Dimension 4

Unfortunately, openturns slows down a lot with dimension 4. It contains a smart implementation of the CDF for dimensions 1, 2 and 3 but falls back on a slower, more generic implementation for dimensions greater than 3.

iters = 1 # loop size
obs = 15 # number of rectangular solids
dim = 4 # dimension of multivariate normal distribution
Dimension 4
mvnun     time: 7.289171e-03
openturns time: 3.689714e+01
ratio mvnun/ot: 0.000198
mvnun-openturns result discrepancy: 6.297527e-07

In dimension 4, the proposed code is slower by about 4 orders of magnitude! This is probably because in dimension 4, it needs to compute 16=2^4 CDFs per rectangular solid, and each of these computations is slower than in smaller dimensions.

like image 64
josephmure Avatar answered Nov 13 '22 12:11

josephmure