Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Interpreting scipy.stats.entropy values

I am trying to use scipy.stats.entropy to estimate the Kullback–Leibler (KL) divergence between two distributions. More specifically, I would like to use the KL as a metric to decide how consistent two distributions are.

However, I cannot interpret the KL values. For ex:

t1=numpy.random.normal(-2.5,0.1,1000)

t2=numpy.random.normal(-2.5,0.1,1000)

scipy.stats.entropy(t1,t2)

0.0015539217193737955

Then,

t1=numpy.random.normal(-2.5,0.1,1000)

t2=numpy.random.normal(2.5,0.1,1000)

scipy.stats.entropy(t1,t2)

= 0.0015908295787942181

How can completely different distributions with essentially no overlap have the same KL value?

t1=numpy.random.normal(-2.5,0.1,1000)

t2=numpy.random.normal(25.,0.1,1000)

scipy.stats.entropy(t1,t2)

= 0.00081111364805590595

This one gives even a smaller KL value (i.e. distance), which I would be inclined to interpret as "more consistent".

Any insights on how to interpret the scipy.stats.entropy (i.e., KL divergence distance) in this context?

like image 619
Scientist Avatar asked Nov 04 '14 19:11

Scientist


People also ask

How to use scipy stats entropy?

Calculate the entropy of a distribution for given probability values. If only probabilities pk are given, the entropy is calculated as S = -sum(pk * log(pk), axis=axis) . If qk is not None, then compute the Kullback-Leibler divergence S = sum(pk * log(pk / qk), axis=axis) .

What is Loc in Scipy stats?

The location ( loc ) keyword specifies the mean. The scale ( scale ) keyword specifies the standard deviation. As an instance of the rv_continuous class, norm object inherits from it a collection of generic methods (see below for the full list), and completes them with details specific for this particular distribution.

What does Scipy stats do in Python?

This module contains a large number of probability distributions, summary and frequency statistics, correlation functions and statistical tests, masked statistics, kernel density estimation, quasi-Monte Carlo functionality, and more.

What is scale in Scipy stats?

The loc and scale parameters let you adjust the location and scale of a distribution. For example, to model IQ data, you'd build iq = scipy. stats. norm(loc=100, scale=15) because IQs are constructed so as to have a mean of 100 and a standard deviation of 15.


1 Answers

numpy.random.normal(-2.5,0.1,1000) is a sample from a normal distribution. It's just 1000 numbers in a random order. The documentation for entropy says:

pk[i] is the (possibly unnormalized) probability of event i.

So to get a meaninful result, you need the numbers to be "aligned" so that the same indices correspond to the same positions in the distribution. In your example t1[0] has no relationship to t2[0]. Your sample doesn't provide any direct information about how probable each value is, which is what you need for the KL divergence; it just gives you some actual values that were taken from the distribution.

The most straightforward way to get aligned values is to evaluate the distribution's probability density function at some fixed set of values. To do this, you need to use scipy.stats.norm (which results a distribution object that can be manipulated in various ways) instead of np.random.normal (which only returns sampled values). Here's an example:

t1 = stats.norm(-2.5, 0.1)
t2 = stats.norm(-2.5, 0.1)
t3 = stats.norm(-2.4, 0.1)
t4 = stats.norm(-2.3, 0.1)

# domain to evaluate PDF on
x = np.linspace(-5, 5, 100)

Then:

>>> stats.entropy(t1.pdf(x), t2.pdf(x))
-0.0
>>> stats.entropy(t1.pdf(x), t3.pdf(x))
0.49999995020647586
>>> stats.entropy(t1.pdf(x), t4.pdf(x))
1.999999900414918

You can see that as the distributions move further apart, their KL divergence increases. (In fact, using your second example will give a KL divergence of inf because they overlap so little.)

like image 84
BrenBarn Avatar answered Sep 28 '22 05:09

BrenBarn