I’m new in Python and I’m trying to see the normalized mutual information between 2 different signals, and no matter what signals I use, the result I obtain is always 1, which I believe it’s impossible because the signals are different and not totally correlated.
I’m using the Normalized Mutual Information Function provided Scikit Learn: sklearn.metrics.normalized mutualinfo_score(labels_true, labels_pred).
Here’s the code I’m using:
from numpy.random import randn
from numpy import *
from matplotlib.pyplot import *
from sklearn.metrics.cluster import normalized_mutual_info_score as mi
import pandas as pd
def fzX(X):
''' z-scoring columns'''
if len(X.shape)>1:
'''X is matrix ... more vars'''
meanX=mean(X,0)
stdX=std(X,0)
stdX[stdX<1e-9]=0
zX=zeros(X.shape)
for i in range(X.shape[1]):
if stdX[i]>0:
zX[:,i]=(X[:,i]-meanX[i])/stdX[i]
else:
zX[:,i]=0
else:
'''X is vector ... more vars'''
meanX=mean(X)
stdX=std(X,0)
zX=(X-meanX)/stdX
return(zX,meanX,stdX)
def fMI(X):
'''vars in columns,
returns mut info of normalized data'''
zX,meanX,stdX=fzX(X)
n=X.shape[1]
Mut_Info=zeros((n,n))
for i in range(n):
for j in range(i,n):
Mut_Info[i,j]=mi(zX[:,i],zX[:,j])
Mut_Info[j,i]=Mut_Info[i,j]
plot(zX);show()
return(Mut_Info)
t=arange(0,100,0.1) # t=0:0.1:99.9
N=len(t) # number of samples in t
u=sin(2*pi*t)+(randn(N)*2)**2
y=(cos(2*pi*t-2))**2+randn(N)*2
X=zeros((len(u),2))
X[:,0]=u
X[:,1]=y
mut=fMI(X)
print mut
plot(X)
show()
Did anyone of you have similar problem before? Do you know what I’m doing wrong?
Thank you very much in advance for your dedicated time.
Normalized Mutual Information (NMI) is a normalization of the Mutual Information (MI) score to scale the results between 0 (no mutual information) and 1 (perfect correlation).
Normalized mutual information (NMI) gives us the reduction in entropy of class labels when we are given the cluster labels. In a sense, NMI tells us how much the uncertainty about class labels decreases when we know the cluster labels. It is similar to the information gain in decision trees.
While mutual information (MI) cannot be negative, the adjusted mutual information (AMI) can be negative. It is also mentioned in the sklearn documentation: The AMI returns a value of 1 when the two partitions are identical (ie perfectly matched).
The Mutual Information is a measure of the similarity between two labels of the same data.
Your floating point data can't be used this way -- normalized_mutual_info_score
is defined over clusters. The function is going to interpret every floating point value as a distinct cluster. And if you look back at the documentation, you'll see that the function throws out information about cluster labels. After all, the labels themselves are arbitrary, so anti-correlated labels have as much mutual information as correlated labels.
Examples
Here are a couple of examples based directly on the documentation:
>>> normalized_mutual_info_score([1, 1, 0, 0], [1, 1, 0, 0])
1.0
>>> normalized_mutual_info_score([1, 1, 0, 0], [0, 0, 1, 1])
1.0
See how the labels are perfectly correlated in the first case, and perfectly anti-correlated in the second? But in both cases, the mutual information is 1.0
. The same pattern continues for partially correlated values:
>>> normalized_mutual_info_score([1, 1, 0, 0], [1, 0, 1, 1])
0.34559202994421129
>>> normalized_mutual_info_score([1, 1, 0, 0], [0, 1, 0, 0])
0.34559202994421129
Swapping the labels just in the second sequence has no effect. And again, this time with floating point values:
>>> normalized_mutual_info_score([0.1, 0.1, 0.5, 0.5], [0.1, 0.1, 0.1, 0.5])
0.34559202994421129
>>> normalized_mutual_info_score([0.1, 0.1, 0.5, 0.5], [0.5, 0.5, 0.5, 0.1])
0.34559202994421129
So having seen all that, this shouldn't seem so surprising:
>>> normalized_mutual_info_score([0.1, 0.2, 0.3, 0.4], [0.5, 0.6, 0.7, 0.8])
1.0
Each floating point is considered its own label, but the labels are themselves arbitrary. So the function can't tell any difference between the two sequences of labels, and returns 1.0
.
Working with floating point data
If you're starting out with floating point data, and you need to do this calculation, you probably want to assign cluster labels, perhaps by putting points into bins using two different schemes.
For example, in the first scheme, you could put every value p <= 0.5
in cluster 0
and p > 0.5
in cluster 1
. Then, in the second scheme, you could put every value p <= 0.4
in cluster 0
and p > 0.4
in cluster 1
. These clusterings would mostly overlap; the points where they did not would cause the mutual information score to go down.
There are other possible clustering schemes -- I'm not quite sure what your goal is, so I can't give more concrete advice than that.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With