Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Computing symmetric Kullback-Leibler divergence between two documents

I have followed the paper here and the code here (it is implemented using the symmetric kld and a back-off model proposed in the paper in the 1st link) for computing KLD between two text data sets. I have changed the for-loop in the end to return the probability distribution of two data sets to test if both sum to 1:

import re, math, collections

def tokenize(_str):
    stopwords = ['and', 'for', 'if', 'the', 'then', 'be', 'is', \
                 'are', 'will', 'in', 'it', 'to', 'that']
    tokens = collections.defaultdict(lambda: 0.)
    for m in re.finditer(r"(\w+)", _str, re.UNICODE):
        m = m.group(1).lower()
        if len(m) < 2: continue
        if m in stopwords: continue
        tokens[m] += 1

    return tokens
#end of tokenize

def kldiv(_s, _t):
    if (len(_s) == 0):
        return 1e33

    if (len(_t) == 0):
        return 1e33

    ssum = 0. + sum(_s.values())
    slen = len(_s)

    tsum = 0. + sum(_t.values())
    tlen = len(_t)

    vocabdiff = set(_s.keys()).difference(set(_t.keys()))
    lenvocabdiff = len(vocabdiff)

    """ epsilon """
    epsilon = min(min(_s.values())/ssum, min(_t.values())/tsum) * 0.001

    """ gamma """
    gamma = 1 - lenvocabdiff * epsilon

    """ Check if distribution probabilities sum to 1"""
    sc = sum([v/ssum for v in _s.itervalues()])
    st = sum([v/tsum for v in _t.itervalues()])

    ps=[] 
    pt = [] 
    for t, v in _s.iteritems(): 
        pts = v / ssum 
        ptt = epsilon 
        if t in _t: 
            ptt = gamma * (_t[t] / tsum) 
        ps.append(pts) 
        pt.append(ptt)
    return ps, pt

I have tested with

d1 = """Many research publications want you to use BibTeX, which better organizes the whole process. Suppose for concreteness your source file is x.tex. Basically, you create a file x.bib containing the bibliography, and run bibtex on that file.""" d2 = """In this case you must supply both a \left and a \right because the delimiter height are made to match whatever is contained between the two commands. But, the \left doesn't have to be an actual 'left delimiter', that is you can use '\left)' if there were some reason to do it."""

sum(ps) = 1 but sum(pt) is way smaller than 1 when:

This should be the case.

Is there something that is not correct in the code or else? Thanks!

Update:

In order to make both pt and ps sum to 1, I had to change the code to:

    vocab = Counter(_s)+Counter(_t)
    ps=[] 
    pt = [] 
    for t, v in vocab.iteritems(): 
        if t in _s:
            pts = gamma * (_s[t] / ssum) 
        else: 
            pts = epsilon

        if t in _t: 
            ptt = gamma * (_t[t] / tsum) 
        else:
            ptt = epsilon

        ps.append(pts) 
        pt.append(ptt)

    return ps, pt
like image 951
Blue482 Avatar asked Feb 18 '16 13:02

Blue482


People also ask

How do you calculate Kullback-Leibler divergence?

KL divergence can be calculated as the negative sum of probability of each event in P multiplied by the log of the probability of the event in Q over the probability of the event in P. The value within the sum is the divergence for a given event.

Is Kullback-Leibler divergence symmetric?

Although the KL divergence measures the “distance” between two distri- butions, it is not a distance measure. This is because that the KL divergence is not a metric measure. It is not symmetric: the KL from p(x) to q(x) is generally not the same as the KL from q(x) to p(x).

How do you interpret Kullback-Leibler divergence?

Intuitively this measures the how much a given arbitrary distribution is away from the true distribution. If two distributions perfectly match, D_{KL} (p||q) = 0 otherwise it can take values between 0 and ∞. Lower the KL divergence value, the better we have matched the true distribution with our approximation.

Is Kullback-Leibler divergence related to cross-entropy?

Kullback–Leibler divergence(KL divergence) In that sense, KL divergence is the difference between cross-entropy and entropy. KL divergence of two distribution p and q is the difference between cross-entropy of two distribution, H(p,q) and entropy of the one distribution, H(p).


1 Answers

Both sum(ps) and sum(pt) are the total probability mass of _s and _t over the support of s (by "support of s" I mean all words that appear in _s, regardless of the words that appear in _t). This means that

  1. sum(ps)==1, since the for-loop sums over all words in _s.
  2. sum(pt) <= 1, where equality will hold if the support of t is a subset of the support of s (that is, if all words in _t appear in _s). Also, sum(pt) might be close to 0 if the overlap between words in _s and _t is small. Specifically, if the intersection of _s and _t is the empty set, then sum(pt) == epsilon*len(_s).

So, I don't think there's a problem with the code.

Also, contrary to the title of the question, kldiv() does not compute the symmetric KL-divergence, but the KL-divergence between _s and a smoothed version of _t.

like image 68
Tomer Levinboim Avatar answered Oct 19 '22 02:10

Tomer Levinboim