Normalized Mutual Information by Scikit Learn giving me wrong value

Tags:

python

I’m new in Python and I’m trying to see the normalized mutual information between 2 different signals, and no matter what signals I use, the result I obtain is always 1, which I believe it’s impossible because the signals are different and not totally correlated.

I’m using the Normalized Mutual Information Function provided Scikit Learn: sklearn.metrics.normalized mutualinfo_score(labels_true, labels_pred).

Here’s the code I’m using:

from numpy.random import randn
from numpy import *
from matplotlib.pyplot import *
from sklearn.metrics.cluster import normalized_mutual_info_score as mi
import pandas as pd

def fzX(X):
''' z-scoring columns'''
if len(X.shape)>1:
    '''X is matrix ... more vars'''
    meanX=mean(X,0)
    stdX=std(X,0)
    stdX[stdX<1e-9]=0
    zX=zeros(X.shape)
    for i in range(X.shape[1]):
        if stdX[i]>0:
            zX[:,i]=(X[:,i]-meanX[i])/stdX[i]
        else:
            zX[:,i]=0
else:
    '''X is vector ... more vars'''
    meanX=mean(X)
    stdX=std(X,0)
    zX=(X-meanX)/stdX
return(zX,meanX,stdX)

def fMI(X):
'''vars in columns,
   returns mut info of normalized data'''
zX,meanX,stdX=fzX(X)
n=X.shape[1]
Mut_Info=zeros((n,n))
for i in range(n):
    for j in range(i,n):
        Mut_Info[i,j]=mi(zX[:,i],zX[:,j])
        Mut_Info[j,i]=Mut_Info[i,j]
plot(zX);show()
return(Mut_Info)

t=arange(0,100,0.1)  # t=0:0.1:99.9
N=len(t)  # number of samples in t
u=sin(2*pi*t)+(randn(N)*2)**2
y=(cos(2*pi*t-2))**2+randn(N)*2

X=zeros((len(u),2))
X[:,0]=u
X[:,1]=y

mut=fMI(X)
print mut

plot(X)
show()

Did anyone of you have similar problem before? Do you know what I’m doing wrong?

Thank you very much in advance for your dedicated time.

963

asked May 12 '15 17:05

António Cova

1 Answers

Your floating point data can't be used this way -- normalized_mutual_info_score is defined over clusters. The function is going to interpret every floating point value as a distinct cluster. And if you look back at the documentation, you'll see that the function throws out information about cluster labels. After all, the labels themselves are arbitrary, so anti-correlated labels have as much mutual information as correlated labels.

Examples

Here are a couple of examples based directly on the documentation:

>>> normalized_mutual_info_score([1, 1, 0, 0], [1, 1, 0, 0])
1.0
>>> normalized_mutual_info_score([1, 1, 0, 0], [0, 0, 1, 1])
1.0

See how the labels are perfectly correlated in the first case, and perfectly anti-correlated in the second? But in both cases, the mutual information is 1.0. The same pattern continues for partially correlated values:

>>> normalized_mutual_info_score([1, 1, 0, 0], [1, 0, 1, 1])
0.34559202994421129
>>> normalized_mutual_info_score([1, 1, 0, 0], [0, 1, 0, 0])
0.34559202994421129

Swapping the labels just in the second sequence has no effect. And again, this time with floating point values:

>>> normalized_mutual_info_score([0.1, 0.1, 0.5, 0.5], [0.1, 0.1, 0.1, 0.5])
0.34559202994421129
>>> normalized_mutual_info_score([0.1, 0.1, 0.5, 0.5], [0.5, 0.5, 0.5, 0.1])
0.34559202994421129

So having seen all that, this shouldn't seem so surprising:

>>> normalized_mutual_info_score([0.1, 0.2, 0.3, 0.4], [0.5, 0.6, 0.7, 0.8])
1.0

Each floating point is considered its own label, but the labels are themselves arbitrary. So the function can't tell any difference between the two sequences of labels, and returns 1.0.

Working with floating point data

If you're starting out with floating point data, and you need to do this calculation, you probably want to assign cluster labels, perhaps by putting points into bins using two different schemes.

For example, in the first scheme, you could put every value p <= 0.5 in cluster 0 and p > 0.5 in cluster 1. Then, in the second scheme, you could put every value p <= 0.4 in cluster 0 and p > 0.4 in cluster 1. These clusterings would mostly overlap; the points where they did not would cause the mutual information score to go down.

There are other possible clustering schemes -- I'm not quite sure what your goal is, so I can't give more concrete advice than that.

113

answered Oct 05 '22 00:10

senderle

Related questions
                            
                                Python igraph vertex indices
                            
                                Get the path of the current TTY in Python
                            
                                Docker: using container with headless Selenium Chromedriver
                            
                                how to get the index of the largest n values in a multi-dimensional numpy array [duplicate]
                            
                                Manual commit in Django 1.8
                            
                                Python - NumPy - deleting multiple rows and columns from an array
                            
                                How can I pack numpy bool arrays into a string of bits?
                            
                                Is it possible to remove a method from a module?
                            
                                Understanding Python Descriptors
                            
                                Why won't my Button widget expand vertically
                            
                                Why are my imports no longer working?
                            
                                detect if variable is of sympy type
                            
                                sys.getsizeof(list) returns less than the sum of its elements
                            
                                How to get n x n covariance matrix for n arrays in Python?
                            
                                python - iterate through list whose elements have variable length
                            
                                pandas DataFrame plot marker
                            
                                PySide.QtGui RuntimeError: '__init__' method of object's base class not called ...but it was
                            
                                Can one use list comprehension derivatives in its methods?
                            
                                Virtualenv in source control
                            
                                pandas: calculate mean of numpy array for each row in a column

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With