Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Normalized Mutual Information by Scikit Learn giving me wrong value

Tags:

python

I’m new in Python and I’m trying to see the normalized mutual information between 2 different signals, and no matter what signals I use, the result I obtain is always 1, which I believe it’s impossible because the signals are different and not totally correlated.

I’m using the Normalized Mutual Information Function provided Scikit Learn: sklearn.metrics.normalized mutualinfo_score(labels_true, labels_pred).

Here’s the code I’m using:

from numpy.random import randn
from numpy import *
from matplotlib.pyplot import *
from sklearn.metrics.cluster import normalized_mutual_info_score as mi
import pandas as pd

def fzX(X):
''' z-scoring columns'''
if len(X.shape)>1:
    '''X is matrix ... more vars'''
    meanX=mean(X,0)
    stdX=std(X,0)
    stdX[stdX<1e-9]=0
    zX=zeros(X.shape)
    for i in range(X.shape[1]):
        if stdX[i]>0:
            zX[:,i]=(X[:,i]-meanX[i])/stdX[i]
        else:
            zX[:,i]=0
else:
    '''X is vector ... more vars'''
    meanX=mean(X)
    stdX=std(X,0)
    zX=(X-meanX)/stdX
return(zX,meanX,stdX)

def fMI(X):
'''vars in columns,
   returns mut info of normalized data'''
zX,meanX,stdX=fzX(X)
n=X.shape[1]
Mut_Info=zeros((n,n))
for i in range(n):
    for j in range(i,n):
        Mut_Info[i,j]=mi(zX[:,i],zX[:,j])
        Mut_Info[j,i]=Mut_Info[i,j]
plot(zX);show()
return(Mut_Info)

t=arange(0,100,0.1)  # t=0:0.1:99.9
N=len(t)  # number of samples in t
u=sin(2*pi*t)+(randn(N)*2)**2
y=(cos(2*pi*t-2))**2+randn(N)*2

X=zeros((len(u),2))
X[:,0]=u
X[:,1]=y

mut=fMI(X)
print mut

plot(X)
show()

Did anyone of you have similar problem before? Do you know what I’m doing wrong?

Thank you very much in advance for your dedicated time.

like image 963
António Cova Avatar asked May 12 '15 17:05

António Cova


People also ask

What is normalized mutual information score?

Normalized Mutual Information (NMI) is a normalization of the Mutual Information (MI) score to scale the results between 0 (no mutual information) and 1 (perfect correlation).

What is normalized mutual information in clustering?

Normalized mutual information (NMI) gives us the reduction in entropy of class labels when we are given the cluster labels. In a sense, NMI tells us how much the uncertainty about class labels decreases when we know the cluster labels. It is similar to the information gain in decision trees.

Can adjusted mutual information be negative?

While mutual information (MI) cannot be negative, the adjusted mutual information (AMI) can be negative. It is also mentioned in the sklearn documentation: The AMI returns a value of 1 when the two partitions are identical (ie perfectly matched).

What is mutual information sklearn?

The Mutual Information is a measure of the similarity between two labels of the same data.


1 Answers

Your floating point data can't be used this way -- normalized_mutual_info_score is defined over clusters. The function is going to interpret every floating point value as a distinct cluster. And if you look back at the documentation, you'll see that the function throws out information about cluster labels. After all, the labels themselves are arbitrary, so anti-correlated labels have as much mutual information as correlated labels.

Examples

Here are a couple of examples based directly on the documentation:

>>> normalized_mutual_info_score([1, 1, 0, 0], [1, 1, 0, 0])
1.0
>>> normalized_mutual_info_score([1, 1, 0, 0], [0, 0, 1, 1])
1.0

See how the labels are perfectly correlated in the first case, and perfectly anti-correlated in the second? But in both cases, the mutual information is 1.0. The same pattern continues for partially correlated values:

>>> normalized_mutual_info_score([1, 1, 0, 0], [1, 0, 1, 1])
0.34559202994421129
>>> normalized_mutual_info_score([1, 1, 0, 0], [0, 1, 0, 0])
0.34559202994421129

Swapping the labels just in the second sequence has no effect. And again, this time with floating point values:

>>> normalized_mutual_info_score([0.1, 0.1, 0.5, 0.5], [0.1, 0.1, 0.1, 0.5])
0.34559202994421129
>>> normalized_mutual_info_score([0.1, 0.1, 0.5, 0.5], [0.5, 0.5, 0.5, 0.1])
0.34559202994421129

So having seen all that, this shouldn't seem so surprising:

>>> normalized_mutual_info_score([0.1, 0.2, 0.3, 0.4], [0.5, 0.6, 0.7, 0.8])
1.0

Each floating point is considered its own label, but the labels are themselves arbitrary. So the function can't tell any difference between the two sequences of labels, and returns 1.0.

Working with floating point data

If you're starting out with floating point data, and you need to do this calculation, you probably want to assign cluster labels, perhaps by putting points into bins using two different schemes.

For example, in the first scheme, you could put every value p <= 0.5 in cluster 0 and p > 0.5 in cluster 1. Then, in the second scheme, you could put every value p <= 0.4 in cluster 0 and p > 0.4 in cluster 1. These clusterings would mostly overlap; the points where they did not would cause the mutual information score to go down.

There are other possible clustering schemes -- I'm not quite sure what your goal is, so I can't give more concrete advice than that.

like image 113
senderle Avatar answered Oct 05 '22 00:10

senderle