Optimal way to compute pairwise mutual information using numpy

Tags:

For an m x n matrix, what's the optimal (fastest) way to compute the mutual information for all pairs of columns (n x n)?

By mutual information, I mean:

I(X, Y) = H(X) + H(Y) - H(X,Y)

where H(X) refers to the Shannon entropy of X.

Currently I'm using np.histogram2d and np.histogram to calculate the joint (X,Y) and individual (X or Y) counts. For a given matrix A (e.g. a 250000 X 1000 matrix of floats), I am doing a nested for loop,

    n = A.shape[1]     for ix = arange(n)           for jx = arange(ix+1,n):            matMI[ix,jx]= calc_MI(A[:,ix],A[:,jx])

Surely there must be better/faster ways to do this?

As an aside, I've also looked for mapping functions on columns (column-wise or row-wise operations) on arrays, but haven't found a good general answer yet.

Here is my full implementation, following the conventions in the Wiki page:

import numpy as np  def calc_MI(X,Y,bins):     c_XY = np.histogram2d(X,Y,bins)[0]    c_X = np.histogram(X,bins)[0]    c_Y = np.histogram(Y,bins)[0]     H_X = shan_entropy(c_X)    H_Y = shan_entropy(c_Y)    H_XY = shan_entropy(c_XY)     MI = H_X + H_Y - H_XY    return MI  def shan_entropy(c):     c_normalized = c / float(np.sum(c))     c_normalized = c_normalized[np.nonzero(c_normalized)]     H = -sum(c_normalized* np.log2(c_normalized))       return H  A = np.array([[ 2.0,  140.0,  128.23, -150.5, -5.4  ],               [ 2.4,  153.11, 130.34, -130.1, -9.5  ],               [ 1.2,  156.9,  120.11, -110.45,-1.12 ]])  bins = 5 # ? n = A.shape[1] matMI = np.zeros((n, n))  for ix in np.arange(n):     for jx in np.arange(ix+1,n):         matMI[ix,jx] = calc_MI(A[:,ix], A[:,jx], bins)

Although my working version with nested for loops does it at reasonable speed, I'd like to know if there is a more optimal way to apply calc_MI on all the columns of A (to calculate their pairwise mutual information)?

I'd also like to know:

Whether there are efficient ways to map functions to operate on columns (or rows) of np.arrays (maybe like np.vectorize, which looks more like a decorator)?
Whether there are other optimal implementations for this specific calculation (mutual information)?

427

asked Dec 10 '13 09:12

nahsivar

1 Answers

I can't suggest a faster calculation for the outer loop over the n*(n-1)/2 vectors, but your implementation of calc_MI(x, y, bins) can be simplified if you can use scipy version 0.13 or scikit-learn.

In scipy 0.13, the lambda_ argument was added to scipy.stats.chi2_contingency This argument controls the statistic that is computed by the function. If you use lambda_="log-likelihood" (or lambda_=0), the log-likelihood ratio is returned. This is also often called the G or G² statistic. Other than a factor of 2*n (where n is the total number of samples in the contingency table), this is the mutual information. So you could implement calc_MI as:

from scipy.stats import chi2_contingency  def calc_MI(x, y, bins):     c_xy = np.histogram2d(x, y, bins)[0]     g, p, dof, expected = chi2_contingency(c_xy, lambda_="log-likelihood")     mi = 0.5 * g / c_xy.sum()     return mi

The only difference between this and your implementation is that this implementation uses the natural logarithm instead of the base-2 logarithm (so it is expressing the information in "nats" instead of "bits"). If you really prefer bits, just divide mi by log(2).

If you have (or can install) sklearn (i.e. scikit-learn), you can use sklearn.metrics.mutual_info_score, and implement calc_MI as:

from sklearn.metrics import mutual_info_score  def calc_MI(x, y, bins):     c_xy = np.histogram2d(x, y, bins)[0]     mi = mutual_info_score(None, None, contingency=c_xy)     return mi

109

answered Oct 10 '22 23:10

Warren Weckesser

Related questions
                            
                                Average values in two Numpy arrays
                            
                                plotting different colors in matplotlib [duplicate]
                            
                                How to use Django to get the name for the host server?
                            
                                virtualenv in PowerShell?
                            
                                How to use export with Python on Linux
                            
                                Python, why elif keyword? [closed]
                            
                                Increase distance between title and plot in matplolib? [duplicate]
                            
                                How to implement an efficient infinite generator of prime numbers in Python?
                            
                                How to add or increment a dictionary entry?
                            
                                Pymysql Insert Into not working
                            
                                removing emojis from a string in Python
                            
                                How to plot empirical cdf (ecdf)
                            
                                "Failed building wheel for psycopg2" - MacOSX using virtualenv and pip
                            
                                Parallel Python: What is a callback?
                            
                                Find the item with maximum occurrences in a list [duplicate]
                            
                                AWS Lambda : OpenBLAS WARNING - could not determine the L2 cache size on this system, assuming 256k - While using Google Custom Search API
                            
                                "Series objects are mutable and cannot be hashed" error
                            
                                Rendering JSON objects using a Django template after an Ajax call
                            
                                pandas equivalent of np.where
                            
                                conda: remove all installed packages from base/root environment

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Optimal way to compute pairwise mutual information using numpy

Tags:

performance

python

numpy

scipy

information-theory

nahsivar

People also ask

1 Answers

Warren Weckesser

Recent Activity

Donate For Us