Computing symmetric Kullback-Leibler divergence between two documents

Tags:

I have followed the paper here and the code here (it is implemented using the symmetric kld and a back-off model proposed in the paper in the 1st link) for computing KLD between two text data sets. I have changed the for-loop in the end to return the probability distribution of two data sets to test if both sum to 1:

import re, math, collections

def tokenize(_str):
    stopwords = ['and', 'for', 'if', 'the', 'then', 'be', 'is', \
                 'are', 'will', 'in', 'it', 'to', 'that']
    tokens = collections.defaultdict(lambda: 0.)
    for m in re.finditer(r"(\w+)", _str, re.UNICODE):
        m = m.group(1).lower()
        if len(m) < 2: continue
        if m in stopwords: continue
        tokens[m] += 1

    return tokens
#end of tokenize

def kldiv(_s, _t):
    if (len(_s) == 0):
        return 1e33

    if (len(_t) == 0):
        return 1e33

    ssum = 0. + sum(_s.values())
    slen = len(_s)

    tsum = 0. + sum(_t.values())
    tlen = len(_t)

    vocabdiff = set(_s.keys()).difference(set(_t.keys()))
    lenvocabdiff = len(vocabdiff)

    """ epsilon """
    epsilon = min(min(_s.values())/ssum, min(_t.values())/tsum) * 0.001

    """ gamma """
    gamma = 1 - lenvocabdiff * epsilon

    """ Check if distribution probabilities sum to 1"""
    sc = sum([v/ssum for v in _s.itervalues()])
    st = sum([v/tsum for v in _t.itervalues()])

    ps=[] 
    pt = [] 
    for t, v in _s.iteritems(): 
        pts = v / ssum 
        ptt = epsilon 
        if t in _t: 
            ptt = gamma * (_t[t] / tsum) 
        ps.append(pts) 
        pt.append(ptt)
    return ps, pt

I have tested with

d1 = """Many research publications want you to use BibTeX, which better organizes the whole process. Suppose for concreteness your source file is x.tex. Basically, you create a file x.bib containing the bibliography, and run bibtex on that file.""" d2 = """In this case you must supply both a \left and a \right because the delimiter height are made to match whatever is contained between the two commands. But, the \left doesn't have to be an actual 'left delimiter', that is you can use '\left)' if there were some reason to do it."""

sum(ps) = 1 but sum(pt) is way smaller than 1 when:

This should be the case.

Is there something that is not correct in the code or else? Thanks!

Update:

In order to make both pt and ps sum to 1, I had to change the code to:

    vocab = Counter(_s)+Counter(_t)
    ps=[] 
    pt = [] 
    for t, v in vocab.iteritems(): 
        if t in _s:
            pts = gamma * (_s[t] / ssum) 
        else: 
            pts = epsilon

        if t in _t: 
            ptt = gamma * (_t[t] / tsum) 
        else:
            ptt = epsilon

        ps.append(pts) 
        pt.append(ptt)

    return ps, pt

951

asked Feb 18 '16 13:02

Blue482

1 Answers

Both sum(ps) and sum(pt) are the total probability mass of _s and _t over the support of s (by "support of s" I mean all words that appear in _s, regardless of the words that appear in _t). This means that

sum(ps)==1, since the for-loop sums over all words in _s.
sum(pt) <= 1, where equality will hold if the support of t is a subset of the support of s (that is, if all words in _t appear in _s). Also, sum(pt) might be close to 0 if the overlap between words in _s and _t is small. Specifically, if the intersection of _s and _t is the empty set, then sum(pt) == epsilon*len(_s).

So, I don't think there's a problem with the code.

Also, contrary to the title of the question, kldiv() does not compute the symmetric KL-divergence, but the KL-divergence between _s and a smoothed version of _t.

answered Oct 19 '22 02:10

Tomer Levinboim

Related questions
                            
                                How can I script video playback with output to multiple screens?
                            
                                Can't pickle Function
                            
                                Django dev server request.META has all my env vars
                            
                                Vagrant debugging python/django on Pycharm
                            
                                Get "actual" length of string in Unicode characters
                            
                                What do low_memory and memory_map flags do in pd.read_csv
                            
                                pip install with shell completion
                            
                                Convert all relative imports to absolute automatically in python
                            
                                Proper way to cast numpy.matrix to C double pointer
                            
                                Why isn't locale.strxfrm("Gè") a prefix of locale.strxfrm("Gène")) with locale "fr_FR.UTF-8"?
                            
                                How to import two versions of the same python module at the same time?
                            
                                Is there a django admin widget for adding multiple foreign keys with an inline through_model
                            
                                How/where to store temp files and logs for a cloud app?
                            
                                Include list_route methods in Django REST framework's API root
                            
                                sqlalchemy CompileError Unconsumed column names when deleting row from m2m table
                            
                                Calculate how a value differs from the average of values using the Gaussian Kernel Density (Python)
                            
                                How can I show verbose py.test diffs without verbose test progress?
                            
                                PEP 0008: What does the BDFL mean by 'in true XP style'?
                            
                                How to randomly shuffle a list that has more permutations than the PRNG's period?
                            
                                Difference between list comprehension and generator comprehension with `yield` inside

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Computing symmetric Kullback-Leibler divergence between two documents

Tags:

python

nlp

similarity

information-retrieval

Blue482

People also ask

1 Answers

Tomer Levinboim

Recent Activity

Donate For Us