For an m x n matrix, what's the optimal (fastest) way to compute the mutual information for all pairs of columns (n x n)?
By mutual information, I mean:
I(X, Y) = H(X) + H(Y) - H(X,Y)
where H(X) refers to the Shannon entropy of X.
Currently I'm using np.histogram2d
and np.histogram
to calculate the joint (X,Y) and individual (X or Y) counts. For a given matrix A
(e.g. a 250000 X 1000 matrix of floats), I am doing a nested for
loop,
n = A.shape[1] for ix = arange(n) for jx = arange(ix+1,n): matMI[ix,jx]= calc_MI(A[:,ix],A[:,jx])
Surely there must be better/faster ways to do this?
As an aside, I've also looked for mapping functions on columns (column-wise or row-wise operations) on arrays, but haven't found a good general answer yet.
Here is my full implementation, following the conventions in the Wiki page:
import numpy as np def calc_MI(X,Y,bins): c_XY = np.histogram2d(X,Y,bins)[0] c_X = np.histogram(X,bins)[0] c_Y = np.histogram(Y,bins)[0] H_X = shan_entropy(c_X) H_Y = shan_entropy(c_Y) H_XY = shan_entropy(c_XY) MI = H_X + H_Y - H_XY return MI def shan_entropy(c): c_normalized = c / float(np.sum(c)) c_normalized = c_normalized[np.nonzero(c_normalized)] H = -sum(c_normalized* np.log2(c_normalized)) return H A = np.array([[ 2.0, 140.0, 128.23, -150.5, -5.4 ], [ 2.4, 153.11, 130.34, -130.1, -9.5 ], [ 1.2, 156.9, 120.11, -110.45,-1.12 ]]) bins = 5 # ? n = A.shape[1] matMI = np.zeros((n, n)) for ix in np.arange(n): for jx in np.arange(ix+1,n): matMI[ix,jx] = calc_MI(A[:,ix], A[:,jx], bins)
Although my working version with nested for
loops does it at reasonable speed, I'd like to know if there is a more optimal way to apply calc_MI
on all the columns of A
(to calculate their pairwise mutual information)?
I'd also like to know:
Whether there are efficient ways to map functions to operate on columns (or rows) of np.arrays
(maybe like np.vectorize
, which looks more like a decorator)?
Whether there are other optimal implementations for this specific calculation (mutual information)?
To calculate mutual information, you need to know the distribution of the pair (X,Y) which is counts for each possible value of the pair. This would be described by a 2 dimensional matrix as in https://stackoverflow.com/questions/20491028/optimal-way-to-compute-pairwise-mutual-information-using-numpy.
The mutual information can also be calculated as the KL divergence between the joint probability distribution and the product of the marginal probabilities for each variable. — Page 57, Pattern Recognition and Machine Learning, 2006. This can be stated formally as follows: I(X ; Y) = KL(p(X, Y) || p(X) * p(Y))
The MI score will fall in the range from 0 to ∞. The higher value, the closer connection between this feature and the target, which suggests that we should put this feature in the training dataset. If the MI score is 0 or very low like 0.01. the low score suggests a weak connection between this feature and the target.
Mutual information (MI) is a non-negative value that measures the mutual dependence between two random variables. The mutual information measures the amount of information we can know from one variable by observing the values of the second variable.
I can't suggest a faster calculation for the outer loop over the n*(n-1)/2 vectors, but your implementation of calc_MI(x, y, bins)
can be simplified if you can use scipy version 0.13 or scikit-learn.
In scipy 0.13, the lambda_
argument was added to scipy.stats.chi2_contingency
This argument controls the statistic that is computed by the function. If you use lambda_="log-likelihood"
(or lambda_=0
), the log-likelihood ratio is returned. This is also often called the G or G2 statistic. Other than a factor of 2*n (where n is the total number of samples in the contingency table), this is the mutual information. So you could implement calc_MI
as:
from scipy.stats import chi2_contingency def calc_MI(x, y, bins): c_xy = np.histogram2d(x, y, bins)[0] g, p, dof, expected = chi2_contingency(c_xy, lambda_="log-likelihood") mi = 0.5 * g / c_xy.sum() return mi
The only difference between this and your implementation is that this implementation uses the natural logarithm instead of the base-2 logarithm (so it is expressing the information in "nats" instead of "bits"). If you really prefer bits, just divide mi
by log(2).
If you have (or can install) sklearn
(i.e. scikit-learn), you can use sklearn.metrics.mutual_info_score
, and implement calc_MI
as:
from sklearn.metrics import mutual_info_score def calc_MI(x, y, bins): c_xy = np.histogram2d(x, y, bins)[0] mi = mutual_info_score(None, None, contingency=c_xy) return mi
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With