NLTK package to estimate the (unigram) perplexity

Tags:

I am trying to calculate the perplexity for the data I have. The code I am using is:

 import sys
 sys.path.append("/usr/local/anaconda/lib/python2.7/site-packages/nltk")

from nltk.corpus import brown
from nltk.model import NgramModel
from nltk.probability import LidstoneProbDist, WittenBellProbDist
estimator = lambda fdist, bins: LidstoneProbDist(fdist, 0.2)
lm = NgramModel(3, brown.words(categories='news'), True, False, estimator)
print lm

But I am receiving the error,

File "/usr/local/anaconda/lib/python2.7/site-packages/nltk/model/ngram.py", line 107, in __init__
cfd[context][token] += 1
TypeError: 'int' object has no attribute '__getitem__'

I have already performed Latent Dirichlet Allocation for the data I have and I have generated the unigrams and their respective probabilities (they are normalized as the sum of total probabilities of the data is 1).

My unigrams and their probability looks like:

Negroponte 1.22948976891e-05
Andreas 7.11290670484e-07
Rheinberg 7.08255885794e-07
Joji 4.48481435106e-07
Helguson 1.89936727391e-07
CAPTION_spot 2.37395965468e-06
Mortimer 1.48540253778e-07
yellow 1.26582575863e-05
Sugar 1.49563800878e-06
four 0.000207196011781

This is just a fragment of the unigrams file I have. The same format is followed for about 1000s of lines. The total probabilities (second column) summed gives 1.

I am a budding programmer. This ngram.py belongs to the nltk package and I am confused as to how to rectify this. The sample code I have here is from the nltk documentation and I don't know what to do now. Please help on what I can do. Thanks in advance!

266

asked Oct 21 '15 18:10

Ana_Sam

1 Answers

Perplexity is the inverse probability of the test set, normalized by the number of words. In the case of unigrams:

enter image description here

Now you say you have already constructed the unigram model, meaning, for each word you have the relevant probability. Then you only need to apply the formula. I assume you have a big dictionary unigram[word] that would provide the probability of each word in the corpus. You also need to have a test set. If your unigram model is not in the form of a dictionary, tell me what data structure you have used, so I could adapt it to my solution accordingly.

perplexity = 1
N = 0

for word in testset:
    if word in unigram:
        N += 1
        perplexity = perplexity * (1/unigram[word])
perplexity = pow(perplexity, 1/float(N))

UPDATE:

As you asked for a complete working example, here's a very simple one.

Suppose this is our corpus:

corpus ="""
Monty Python (sometimes known as The Pythons) were a British surreal comedy group who created the sketch comedy show Monty Python's Flying Circus,
that first aired on the BBC on October 5, 1969. Forty-five episodes were made over four series. The Python phenomenon developed from the television series
into something larger in scope and impact, spawning touring stage shows, films, numerous albums, several books, and a stage musical.
The group's influence on comedy has been compared to The Beatles' influence on music."""

Here's how we construct the unigram model first:

import collections, nltk
# we first tokenize the text corpus
tokens = nltk.word_tokenize(corpus)

#here you construct the unigram language model 
def unigram(tokens):    
    model = collections.defaultdict(lambda: 0.01)
    for f in tokens:
        try:
            model[f] += 1
        except KeyError:
            model [f] = 1
            continue
    N = float(sum(model.values()))
    for word in model:
        model[word] = model[word]/N
    return model

Our model here is smoothed. For words outside the scope of its knowledge, it assigns a low probability of 0.01. I already told you how to compute perplexity:

#computes perplexity of the unigram model on a testset  
def perplexity(testset, model):
    testset = testset.split()
    perplexity = 1
    N = 0
    for word in testset:
        N += 1
        perplexity = perplexity * (1/model[word])
    perplexity = pow(perplexity, 1/float(N)) 
    return perplexity

Now we can test this on two different test sets:

testset1 = "Monty"
testset2 = "abracadabra gobbledygook rubbish"

model = unigram(tokens)
print perplexity(testset1, model)
print perplexity(testset2, model)

for which you get the following result:

>>> 
49.09452736318415
99.99999999999997

Note that when dealing with perplexity, we try to reduce it. A language model that has less perplexity with regards to a certain test set is more desirable than one with a bigger perplexity. In the first test set, the word Monty was included in the unigram model, so the respective number for perplexity was also smaller.

answered Sep 25 '22 01:09

Omid

Related questions
                            
                                TypeError: 'PathCollection' object is not iterable when adding second legend to plot
                            
                                Running Luigi task from cmd - "No module named tasks"
                            
                                parallel post requests using multiprocessing and requests in Python
                            
                                from urllib3.util.ssl_ import ( ImportError: cannot import name ssl
                            
                                Not able to start `django` project in local as well as in docker
                            
                                Run Python in cmd [duplicate]
                            
                                Psycopg installation on windows
                            
                                flask-login: Chrome ignoring cookie expiration?
                            
                                How to tell Jenkins to use a particular virtualenv python
                            
                                Using .iteritems() to iterate over key, value in Python dictionary
                            
                                How do you run a setup.py file properly? [duplicate]
                            
                                python-vlc won't start the player
                            
                                Why do lists with the same data have different sizes?
                            
                                How can i check if date is on range on Python? [duplicate]
                            
                                Transparent error bars without affecting markers
                            
                                How do I created nested JSON object with Python?
                            
                                'easy_install' is not recognized as an in internal or external command, operable program or batch file
                            
                                How to keep the window focus on new Toplevel() window in Tkinter?
                            
                                Comparing string and unicode in Python 2.7.5
                            
                                Why does python allow spaces between an object and the method name after the "."

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

NLTK package to estimate the (unigram) perplexity

Tags:

python-2.7

nlp

nltk

n-gram

language-model

Ana_Sam

People also ask

1 Answers

Omid

Recent Activity

Donate For Us