Ngram model and perplexity in NLTK

Tags:

To put my question in context, I would like to train and test/compare several (neural) language models. In order to focus on the models rather than data preparation I chose to use the Brown corpus from nltk and train the Ngrams model provided with the nltk as a baseline (to compare other LM against).

So my first question is actually about a behaviour of the Ngram model of nltk that I find suspicious. Since the code is rather short I pasted it here:

import nltk

print "... build"
brown = nltk.corpus.brown
corpus = [word.lower() for word in brown.words()]

# Train on 95% f the corpus and test on the rest
spl = 95*len(corpus)/100
train = corpus[:spl]
test = corpus[spl:]

# Remove rare words from the corpus
fdist = nltk.FreqDist(w for w in train)
vocabulary = set(map(lambda x: x[0], filter(lambda x: x[1] >= 5, fdist.iteritems())))

train = map(lambda x: x if x in vocabulary else "*unknown*", train)
test = map(lambda x: x if x in vocabulary else "*unknown*", test)

print "... train"
from nltk.model import NgramModel
from nltk.probability import LidstoneProbDist

estimator = lambda fdist, bins: LidstoneProbDist(fdist, 0.2) 
lm = NgramModel(5, train, estimator=estimator)

print "len(corpus) = %s, len(vocabulary) = %s, len(train) = %s, len(test) = %s" % ( len(corpus), len(vocabulary), len(train), len(test) )
print "perplexity(test) =", lm.perplexity(test)

What I find very suspicious is that I get the following results:

... build
... train
len(corpus) = 1161192, len(vocabulary) = 13817, len(train) = 1103132, len(test) = 58060
perplexity(test) = 4.60298447026

With a perplexity of 4.6 it seems Ngram modeling is very good on that corpus. If my interpretation is correct then the model should be able to guess the correct word in roughly 5 tries on average (although there are 13817 possibilities...). If you could share your experience on the value of this perplexity (I don't really believe it)? I did not find any complaints on the ngram model of nltk on the net ( but maybe I do it wrong). Do you know a good alternatives to NLTK for Ngram models and computing perplexity?

Thanks!

462

asked May 12 '13 16:05

zermelozf

1 Answers

You are getting a low perplexity because you are using a pentagram model. If you'd use a bigram model your results will be in more regular ranges of about 50-1000 (or about 5 to 10 bits).

Given your comments, are you using NLTK-3.0alpha? You shouldn't, at least not for language modeling:

https://github.com/nltk/nltk/issues?labels=model

As a matter of fact, the whole model module has been dropped from the NLTK-3.0a4 pre-release until the issues are fixed.

answered Sep 22 '22 06:09

fnl

Related questions
                            
                                How to rotate the 3D scatter plots in google colaboratory?
                            
                                python how to use tika with existing jar file without downloading again
                            
                                How can I get Chrome Browser Version running now with Python? [closed]
                            
                                Weird file seeking behaviour
                            
                                How to select specific columns from tensorflow dataset?
                            
                                Why does assigning with [:] versus iloc[:] yield different results in pandas?
                            
                                Is it possible to generate PDF with StreamingHttpResponse as it's possible to do so with CSV for large dataset?
                            
                                Do asynchronous context managers need to protect their cleanup code from cancellation?
                            
                                Recommended way to run another program from within a Python script [duplicate]
                            
                                Named semaphores in Python?
                            
                                Troubleshooting python sys.path
                            
                                What's the pythonic way to deal with worker processes that must coordinate their tasks?
                            
                                Why doesn't filter attached to the root logger propagate to descendant loggers?
                            
                                What is the preferred method for TCP/IP IPC in stackless Python?
                            
                                Extracting data from HTML-files with BeautifulSoup and Python
                            
                                Raw sockets and sendto in python
                            
                                How to insert non overlapping text in matplotlib?
                            
                                How to define custom float-type numpy dtypes (C-API)
                            
                                Data binning: irregular polygons to regular mesh
                            
                                Python command line program: generate man page from existing documentation and include in the distribution

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Ngram model and perplexity in NLTK

Tags:

python

nltk

n-gram

zermelozf

People also ask

1 Answers

fnl

Recent Activity

Donate For Us