Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Continuous mutual information in Python

[Frontmatter] (skip this if you just want the question):

I'm currently looking at using Shannon-Weaver Mutual Information and normalized redundancy to measure the degree of information masking between bags of discrete and continuous feature values, organized by feature. Using this method, it is my goal to construct an algorithm that looks very similar to ID3, but instead of using Shannon entropy, the algorithm will seek (as a loop constraint) to maximize or minimize shared information between a single feature and a collection of features based on the complete input feature space, adding new features to the latter collection if (and only if) they increase or decrease mutual information, respectively. This, in effect, moves ID3's decision algorithm into pairspace, stapling an ensemble approach to it with all of the expected time and space complexities of both methods.

[/Frontmatter]


On to the question: I'm trying to get a continuous integrator working in Python using SciPy. Because I'm working with comparisons of discrete and continuous variables, my current strategy for each comparison for feature-feature pairs is as follows:

  • Discrete feature versus discrete feature: use the discrete form of mutual information. This results in a double summation of the probabilities, which my code handles without issue.

  • All other cases (discrete versus continuous, the inverse, and continuous versus continuous): use the continuous form, using a Gaussian estimator to smooth out the probability density functions.

It is possible for me to perform some kind of discretization for the latter cases, but because my input data sets are not inherently linear, this is potentially needlessly complex.


Here's the salient code:

import math
import numpy
import scipy
from scipy.stats import gaussian_kde
from scipy.integrate import dblquad

# Constants
MIN_DOUBLE = 4.9406564584124654e-324 
                    # The minimum size of a Float64; used here to prevent the
                    #  logarithmic function from hitting its undefined region
                    #  at its asymptote of 0.
INF = float('inf')  # The floating-point representation for "infinity"

# x and y are previously defined as collections of 
# floating point values with the same length

# Kernel estimation
gkde_x = gaussian_kde(x)
gkde_y = gaussian_kde(y)

if len(binned_x) != len(binned_y) and len(binned_x) != len(x):
    x.append(x[0])
    y.append(y[0])

gkde_xy = gaussian_kde([x,y])
mutual_info = lambda a,b: gkde_xy([a,b]) * \
           math.log((gkde_xy([a,b]) / (gkde_x(a) * gkde_y(b))) + MIN_DOUBLE)

# Compute MI(X,Y)
(minfo_xy, err_xy) = \
    dblquad(mutual_info, -INF, INF, lambda a: 0, lambda a: INF)

print 'minfo_xy = ', minfo_xy

Note that overcounting exactly one point is done deliberately to prevent a singularity in SciPy's gaussian_kde class. As the size of x and y mutually approach infinity, this effect becomes negligible.


My current snag is in trying to get multiple integration working against a Gaussian kernel density estimate in SciPy. I've been trying to use SciPy's dblquad to perform the integration, but in the latter case, I receive an astounding spew of the following messages.

When I set numpy.seterr ( all='ignore' ):

Warning: The ocurrence of roundoff error is detected, which prevents the requested tolerance from being achieved. The error may be underestimated.

And when I set it to 'call' using an error handler:

Floating point error (underflow), with flag 4

Floating point error (invalid value), with flag 8

Pretty easy to figure out what's going on, right? Well, almost: IEEE 754-2008 and SciPy only tell me what's going on here, not why or how to work around it.


The upshot: minfo_xy generally resolves to nan; its sampling is insufficient to prevent information from becoming lost or invalid when performing Float64 math.

Is there a general workaround for this problem when using SciPy?

Even better: if there is a robust, canned implementation of continuous mutual information for Python with an interface that takes two collections of floating point values or a merged collection of pairs, it would resolve this complete problem. Please link it if you know of one that exists.

Thank you in advance.


Edit: this resolves the nan propagation issue in the example above:

mutual_info = lambda a,b: gkde_xy([a,b]) * \
    math.log((gkde_xy([a,b]) / ((gkde_x(a) * gkde_y(b)) + MIN_DOUBLE)) \
        + MIN_DOUBLE)

However, the question of roundoff correction remains, as does the request for a more robust implementation. Any help in either domain would be greatly appreciated.

like image 351
MrGomez Avatar asked Dec 02 '11 21:12

MrGomez


People also ask

What is mutual information Python?

The Mutual Information is a measure of the similarity between two labels of the same data.

What is continuous data in Python?

Continuous variables are those whose values may take any number within a range. Examples of continuous variables include the price of a product, income, house price, or interest rate. Categorical variables are values that are selected from a group of categories, also called labels.

What is mutual_ info_ regression?

Estimate mutual information for a continuous target variable.

How do you calculate mutual information examples?

I(X,W; Y,Z) = I(X;Y) + I(W;Z) \, . This follows easily from the definition of the mutual information. It's also a property that information should have. For example, suppose X and Y are the cost of a bottle of wine and how it tastes, and W and Z are the height and weight of tree squirrels.


1 Answers

Before trying more radical solutions like reframing the problem or using different integration tools, see if this helps. Replace INF=float('INF') with INF=1E12 or some other large number -- that may eliminate NaN results created by simple arithmetic operations on the input variables.

No promises on this one, but it is sometimes helpful to try a quick fix before engaging in a significant algorithmic rewrite or substitution of alternate tools.

like image 197
Raymond Hettinger Avatar answered Sep 20 '22 04:09

Raymond Hettinger