Continuous mutual information in Python

Tags:

[Frontmatter] (skip this if you just want the question):

I'm currently looking at using Shannon-Weaver Mutual Information and normalized redundancy to measure the degree of information masking between bags of discrete and continuous feature values, organized by feature. Using this method, it is my goal to construct an algorithm that looks very similar to ID3, but instead of using Shannon entropy, the algorithm will seek (as a loop constraint) to maximize or minimize shared information between a single feature and a collection of features based on the complete input feature space, adding new features to the latter collection if (and only if) they increase or decrease mutual information, respectively. This, in effect, moves ID3's decision algorithm into pairspace, stapling an ensemble approach to it with all of the expected time and space complexities of both methods.

[/Frontmatter]

On to the question: I'm trying to get a continuous integrator working in Python using SciPy. Because I'm working with comparisons of discrete and continuous variables, my current strategy for each comparison for feature-feature pairs is as follows:

Discrete feature versus discrete feature: use the discrete form of mutual information. This results in a double summation of the probabilities, which my code handles without issue.
All other cases (discrete versus continuous, the inverse, and continuous versus continuous): use the continuous form, using a Gaussian estimator to smooth out the probability density functions.

It is possible for me to perform some kind of discretization for the latter cases, but because my input data sets are not inherently linear, this is potentially needlessly complex.

Here's the salient code:

import math
import numpy
import scipy
from scipy.stats import gaussian_kde
from scipy.integrate import dblquad

# Constants
MIN_DOUBLE = 4.9406564584124654e-324 
                    # The minimum size of a Float64; used here to prevent the
                    #  logarithmic function from hitting its undefined region
                    #  at its asymptote of 0.
INF = float('inf')  # The floating-point representation for "infinity"

# x and y are previously defined as collections of 
# floating point values with the same length

# Kernel estimation
gkde_x = gaussian_kde(x)
gkde_y = gaussian_kde(y)

if len(binned_x) != len(binned_y) and len(binned_x) != len(x):
    x.append(x[0])
    y.append(y[0])

gkde_xy = gaussian_kde([x,y])
mutual_info = lambda a,b: gkde_xy([a,b]) * \
           math.log((gkde_xy([a,b]) / (gkde_x(a) * gkde_y(b))) + MIN_DOUBLE)

# Compute MI(X,Y)
(minfo_xy, err_xy) = \
    dblquad(mutual_info, -INF, INF, lambda a: 0, lambda a: INF)

print 'minfo_xy = ', minfo_xy

Note that overcounting exactly one point is done deliberately to prevent a singularity in SciPy's gaussian_kde class. As the size of x and y mutually approach infinity, this effect becomes negligible.

My current snag is in trying to get multiple integration working against a Gaussian kernel density estimate in SciPy. I've been trying to use SciPy's dblquad to perform the integration, but in the latter case, I receive an astounding spew of the following messages.

When I set numpy.seterr ( all='ignore' ):

Warning: The ocurrence of roundoff error is detected, which prevents the requested tolerance from being achieved. The error may be underestimated.

And when I set it to 'call' using an error handler:

Floating point error (underflow), with flag 4

Floating point error (invalid value), with flag 8

Pretty easy to figure out what's going on, right? Well, almost: IEEE 754-2008 and SciPy only tell me what's going on here, not why or how to work around it.

The upshot: minfo_xy generally resolves to nan; its sampling is insufficient to prevent information from becoming lost or invalid when performing Float64 math.

Is there a general workaround for this problem when using SciPy?

Even better: if there is a robust, canned implementation of continuous mutual information for Python with an interface that takes two collections of floating point values or a merged collection of pairs, it would resolve this complete problem. Please link it if you know of one that exists.

Thank you in advance.

Edit: this resolves the nan propagation issue in the example above:

mutual_info = lambda a,b: gkde_xy([a,b]) * \
    math.log((gkde_xy([a,b]) / ((gkde_x(a) * gkde_y(b)) + MIN_DOUBLE)) \
        + MIN_DOUBLE)

However, the question of roundoff correction remains, as does the request for a more robust implementation. Any help in either domain would be greatly appreciated.

351

asked Dec 02 '11 21:12

MrGomez

1 Answers

Before trying more radical solutions like reframing the problem or using different integration tools, see if this helps. Replace INF=float('INF') with INF=1E12 or some other large number -- that may eliminate NaN results created by simple arithmetic operations on the input variables.

No promises on this one, but it is sometimes helpful to try a quick fix before engaging in a significant algorithmic rewrite or substitution of alternate tools.

197

answered Sep 20 '22 04:09

Raymond Hettinger

Related questions
                            
                                Does any Python library support writing arrays of structs to Parquet files?
                            
                                Overwrite parquet file with pyarrow in S3
                            
                                Python: BeautifulSoup UnboundLocalError
                            
                                How to build and deploy Python web applications
                            
                                SQLAlchemy Example Projects
                            
                                What is the difference between ctypes.pointer, ctypes.POINTER, and ctypes.byref?
                            
                                RabbitMQ non-blocking consumer
                            
                                Matplotlib plots lose transparency when saving as .pdf
                            
                                Creating wxSlider with range on Linux
                            
                                Matplotlib date ticker - exceeds Locator.MAXTICKS error for no apparent reason
                            
                                Scipy Optimizer constraint using two arguments
                            
                                How to debug dying Jupyter Python3 kernel?
                            
                                Imageio in python : compressing gif
                            
                                combining functools.lru_cache with multiprocessing.Pool
                            
                                tensorflow lite model gives very different accuracy value compared to python model
                            
                                Unexpected behavior in assigning 2d numpy array to pandas DataFrame
                            
                                How to inform class weights when using `tensorflow.python.keras.estimator.model_to_estimator` to convert Keras Models to Estimator API?
                            
                                How can we create data columns in Dash Table dynamically using callback with a function providing the dataframe
                            
                                Errors accessing .NET class method overloads in IronPython
                            
                                Python: Installing man pages in distutils based project

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Continuous mutual information in Python

Tags:

python

math

floating-point

scipy

information-theory

MrGomez

People also ask

1 Answers

Raymond Hettinger

Recent Activity

Donate For Us