Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Computation of Kullback-Leibler (KL) distance between text-documents using numpy

My goal is to compute the KL distance between the following text documents:

1)The boy is having a lad relationship
2)The boy is having a boy relationship
3)It is a lovely day in NY

I first of all vectorised the documents in order to easily apply numpy

1)[1,1,1,1,1,1,1]
2)[1,2,1,1,1,2,1]
3)[1,1,1,1,1,1,1]

I then applied the following code for computing KL distance between the texts:

import numpy as np
import math
from math import log

v=[[1,1,1,1,1,1,1],[1,2,1,1,1,2,1],[1,1,1,1,1,1,1]]
c=v[0]
def kl(p, q):
    p = np.asarray(p, dtype=np.float)
    q = np.asarray(q, dtype=np.float)
    return np.sum(np.where(p != 0,(p-q) * np.log10(p / q), 0))
for x in v:
    KL=kl(x,c)
    print KL

Here is the result of the above code: [0.0, 0.602059991328, 0.0]. Texts 1 and 3 are completely different, but the distance between them is 0, while texts 1 and 2, which are highly related has a distance of 0.602059991328. This isn't accurate.

Does anyone has an idea of what I'm not doing right with regards to KL? Many thanks for your suggestions.

like image 450
Tiger1 Avatar asked Aug 22 '13 12:08

Tiger1


People also ask

How do you calculate KL divergence between two distributions in Python?

KL divergence can be calculated as the negative sum of probability of each event in P multiplied by the log of the probability of the event in Q over the probability of the event in P. The value within the sum is the divergence for a given event.

Can KL divergence be used as a distance measure?

Although the KL divergence measures the “distance” between two distri- butions, it is not a distance measure. This is because that the KL divergence is not a metric measure. It is not symmetric: the KL from p(x) to q(x) is generally not the same as the KL from q(x) to p(x).

What is the KL divergence between two equal distributions?

Intuitively this measures the how much a given arbitrary distribution is away from the true distribution. If two distributions perfectly match, D_{KL} (p||q) = 0 otherwise it can take values between 0 and ∞. Lower the KL divergence value, the better we have matched the true distribution with our approximation.

Why do we use KL divergence?

As we've seen, we can use KL divergence to minimize how much information loss we have when approximating a distribution. Combining KL divergence with neural networks allows us to learn very complex approximating distribution for our data.


2 Answers

Though I hate to add another answer, there are two points here. First, as Jaime pointed out in the comments, KL divergence (or distance - they are, according to the following documentation, the same) is designed to measure the difference between probability distributions. This means basically that what you pass to the function should be two array-likes, the elements of each of which sum to 1.

Second, scipy apparently does implement this, with a naming scheme more related to the field of information theory. The function is "entropy":

scipy.stats.entropy(pk, qk=None, base=None)

http://docs.scipy.org/doc/scipy-dev/reference/generated/scipy.stats.entropy.html

From the docs:

If qk is not None, then compute a relative entropy (also known as Kullback-Leibler divergence or Kullback-Leibler distance) S = sum(pk * log(pk / qk), axis=0).

The bonus of this function as well is that it will normalize the vectors you pass it if they do not sum to 1 (though this means you have to be careful with the arrays you pass - ie, how they are constructed from data).

Hope this helps, and at least a library provides it so don't have to code your own.

like image 90
dpb Avatar answered Oct 23 '22 11:10

dpb


After a bit of googling to undersand the KL concept, I think that your problem is due to the vectorization : you're comparing the number of appearance of different words. You should either link your column indice to one word, or use a dictionnary:

#  The boy is having a lad relationship It lovely day in NY
1)[1   1   1  1      1 1   1            0  0      0   0  0]
2)[1   2   1  1      1 0   1            0  0      0   0  0]
3)[0   0   1  0      1 0   0            1  1      1   1  1]

Then you can use your kl function.

To automatically vectorize to a dictionnary, see How to count the frequency of the elements in a list? (collections.Counter is exactly what you need). Then you can loop over the union of the keys of the dictionaries to compute the KL distance.

like image 43
J. Martinot-Lagarde Avatar answered Oct 23 '22 12:10

J. Martinot-Lagarde