I am having some issues implementing the Mutual Information Function that Python's machine learning libraries provide, in particular : sklearn.metrics.mutual_info_score(labels_true, labels_pred, contingency=None) (http://scikit-learn.org/stable/modules/generated/sklearn.metrics.mutual_info_score.html) I am trying to implement the example I find in the Stanford NLP tutorial site: <img src="https://i.imgur.com/VSSd5KS.jpg" alt="example"> The site is found here : http://nlp.stanford.edu/IR-book/html/htmledition/mutual-information-1.html#mifeatsel2 The problem is I keep getting different results, without figuring out the reason yet. I get the concept of Mutual Information and feature selection, I just don't understand how it is implemented in Python. What I do is that I provide the mutual_info_score method with two arrays based on the NLP site example, but it outputs different results. The other interesting fact is that anyhow you play around and change numbers on those arrays you are most likely to get the same result. Am I supposed to use another data structure specific to Python or what is the issue behind this? If anyone has used this function successfully in the past it would be of a great help to me, thank you for your time.

I encountered the same issue today. After a few trials I found the real reason: you take log2 if you strictly followed NLP tutorial, but sklearn.metrics.mutual_info_score uses natural logarithm(base e, Euler's number). I didn't find this detail in sklearn documentation... I verified this by: <pre class="prettyprint"><code>import numpy as np def computeMI(x, y): sum_mi = 0.0 x_value_list = np.unique(x) y_value_list = np.unique(y) Px = np.array([ len(x[x==xval])/float(len(x)) for xval in x_value_list ]) #P(x) Py = np.array([ len(y[y==yval])/float(len(y)) for yval in y_value_list ]) #P(y) for i in xrange(len(x_value_list)): if Px[i] ==0.: continue sy = y[x == x_value_list[i]] if len(sy)== 0: continue pxy = np.array([len(sy[sy==yval])/float(len(y)) for yval in y_value_list]) #p(x,y) t = pxy[Py>0.]/Py[Py>0.] /Px[i] # log(P(x,y)/( P(x)*P(y)) sum_mi += sum(pxy[t>0]*np.log2( t[t>0]) ) # sum ( P(x,y)* log(P(x,y)/( P(x)*P(y)) ) return sum_mi </code></pre> If you change this <code>np.log2</code> to <code>np.log</code>, I think it would give you the same answer as sklearn. The only difference is that when this method returns 0, sklearn will return a number very near to 0. ( And of course, use sklearn if you don't care about log base, my piece of code is just for demo, it gives poor performance...) FYI, 1)<code>sklearn.metrics.mutual_info_score</code> takes lists as well as np.array; 2) the <code>sklearn.metrics.cluster.entropy</code> uses also log, not log2 Edit: as for "same result", I'm not sure what you really mean. In general, the values in the vectors don't really matter, it is the "distribution" of values that matters. You care about P(X=x), P(Y=y) and P(X=x,Y=y), not the value x,y.

Python's implementation of Mutual Information

Tags:

python

machine-learning

feature-selection

I am having some issues implementing the Mutual Information Function that Python's machine learning libraries provide, in particular : sklearn.metrics.mutual_info_score(labels_true, labels_pred, contingency=None)

(http://scikit-learn.org/stable/modules/generated/sklearn.metrics.mutual_info_score.html)

I am trying to implement the example I find in the Stanford NLP tutorial site:

example

The site is found here : http://nlp.stanford.edu/IR-book/html/htmledition/mutual-information-1.html#mifeatsel2

The problem is I keep getting different results, without figuring out the reason yet.

I get the concept of Mutual Information and feature selection, I just don't understand how it is implemented in Python. What I do is that I provide the mutual_info_score method with two arrays based on the NLP site example, but it outputs different results. The other interesting fact is that anyhow you play around and change numbers on those arrays you are most likely to get the same result. Am I supposed to use another data structure specific to Python or what is the issue behind this? If anyone has used this function successfully in the past it would be of a great help to me, thank you for your time.

356

asked Jul 10 '14 21:07

and_apo

1 Answers

I encountered the same issue today. After a few trials I found the real reason: you take log2 if you strictly followed NLP tutorial, but sklearn.metrics.mutual_info_score uses natural logarithm(base e, Euler's number). I didn't find this detail in sklearn documentation...

I verified this by:

import numpy as np
def computeMI(x, y):
    sum_mi = 0.0
    x_value_list = np.unique(x)
    y_value_list = np.unique(y)
    Px = np.array([ len(x[x==xval])/float(len(x)) for xval in x_value_list ]) #P(x)
    Py = np.array([ len(y[y==yval])/float(len(y)) for yval in y_value_list ]) #P(y)
    for i in xrange(len(x_value_list)):
        if Px[i] ==0.:
            continue
        sy = y[x == x_value_list[i]]
        if len(sy)== 0:
            continue
        pxy = np.array([len(sy[sy==yval])/float(len(y))  for yval in y_value_list]) #p(x,y)
        t = pxy[Py>0.]/Py[Py>0.] /Px[i] # log(P(x,y)/( P(x)*P(y))
        sum_mi += sum(pxy[t>0]*np.log2( t[t>0]) ) # sum ( P(x,y)* log(P(x,y)/( P(x)*P(y)) )
    return sum_mi

If you change this np.log2 to np.log, I think it would give you the same answer as sklearn. The only difference is that when this method returns 0, sklearn will return a number very near to 0. ( And of course, use sklearn if you don't care about log base, my piece of code is just for demo, it gives poor performance...)

FYI, 1)sklearn.metrics.mutual_info_score takes lists as well as np.array; 2) the sklearn.metrics.cluster.entropy uses also log, not log2

Edit: as for "same result", I'm not sure what you really mean. In general, the values in the vectors don't really matter, it is the "distribution" of values that matters. You care about P(X=x), P(Y=y) and P(X=x,Y=y), not the value x,y.

173

answered Sep 23 '22 01:09

lyx

Related questions
                            
                                Which features are monkey patched by gunicorn gevent worker?
                            
                                Python - User-defined classes have __cmp__() and __hash__() methods by default? Or?
                            
                                Building a small numpy array from individual values: Fast and readable method?
                            
                                Scaling data in scikit-learn SVM
                            
                                Flask app occasionally hanging
                            
                                Adjacency List and Adjacency Matrix in Python
                            
                                Use case for low-level os.open, os.fdopen, and friends?
                            
                                Equivalent Nested Loop Structure with Itertools
                            
                                How to implement __eq__ for set inclusion test?
                            
                                How can I edit PYTHONPATH on a Mac?
                            
                                How do I prevent memory leak when I load large pickle files in a for loop?
                            
                                How to obtain information gain from a scikit-learn DecisionTreeClassifier?
                            
                                How to fill a polygon with a custom hatch in matplotlib?
                            
                                How to avoid StopIteration Error in python
                            
                                Python cant get full path name of file
                            
                                error when compiling cx_Freeze on Ubuntu
                            
                                Running Multiple Scrapy Spiders (the easy way) Python
                            
                                Is there a Python reusable component that is like the Blender node editor? [closed]
                            
                                Django admin - Mixing multiple model inlines in single admin interface
                            
                                How to modify matplotlib legend after it has been created?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With