Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Ngram model and perplexity in NLTK

To put my question in context, I would like to train and test/compare several (neural) language models. In order to focus on the models rather than data preparation I chose to use the Brown corpus from nltk and train the Ngrams model provided with the nltk as a baseline (to compare other LM against).

So my first question is actually about a behaviour of the Ngram model of nltk that I find suspicious. Since the code is rather short I pasted it here:

import nltk

print "... build"
brown = nltk.corpus.brown
corpus = [word.lower() for word in brown.words()]

# Train on 95% f the corpus and test on the rest
spl = 95*len(corpus)/100
train = corpus[:spl]
test = corpus[spl:]

# Remove rare words from the corpus
fdist = nltk.FreqDist(w for w in train)
vocabulary = set(map(lambda x: x[0], filter(lambda x: x[1] >= 5, fdist.iteritems())))

train = map(lambda x: x if x in vocabulary else "*unknown*", train)
test = map(lambda x: x if x in vocabulary else "*unknown*", test)

print "... train"
from nltk.model import NgramModel
from nltk.probability import LidstoneProbDist

estimator = lambda fdist, bins: LidstoneProbDist(fdist, 0.2) 
lm = NgramModel(5, train, estimator=estimator)

print "len(corpus) = %s, len(vocabulary) = %s, len(train) = %s, len(test) = %s" % ( len(corpus), len(vocabulary), len(train), len(test) )
print "perplexity(test) =", lm.perplexity(test)

What I find very suspicious is that I get the following results:

... build
... train
len(corpus) = 1161192, len(vocabulary) = 13817, len(train) = 1103132, len(test) = 58060
perplexity(test) = 4.60298447026

With a perplexity of 4.6 it seems Ngram modeling is very good on that corpus. If my interpretation is correct then the model should be able to guess the correct word in roughly 5 tries on average (although there are 13817 possibilities...). If you could share your experience on the value of this perplexity (I don't really believe it)? I did not find any complaints on the ngram model of nltk on the net ( but maybe I do it wrong). Do you know a good alternatives to NLTK for Ngram models and computing perplexity?

Thanks!

like image 462
zermelozf Avatar asked May 12 '13 16:05

zermelozf


People also ask

What is ngram NLTK?

first of all lets understand what is Ngram so it means the sequence of N words, for e.g "A mango" is a 2-gram, "the cat is dancing" is 4-gram and many more. Build a Chatbot in Python from Scratch!

What is ngram in Python?

N-grams are continuous sequences of words or symbols or tokens in a document. In technical terms, they can be defined as the neighbouring sequences of items in a document. They come into play when we deal with text data in NLP(Natural Language Processing) tasks.

What are n gram language models?

N-gram Language Model: An N-gram language model predicts the probability of a given N-gram within any sequence of words in the language. A good N-gram model can predict the next word in the sentence i.e the value of p(w|h)

What is Bigram model?

The Bigram Model As the name suggests, the bigram model approximates the probability of a word given all the previous words by using only the conditional probability of one preceding word. In other words, you approximate it with the probability: P(the | that)


1 Answers

You are getting a low perplexity because you are using a pentagram model. If you'd use a bigram model your results will be in more regular ranges of about 50-1000 (or about 5 to 10 bits).

Given your comments, are you using NLTK-3.0alpha? You shouldn't, at least not for language modeling:

https://github.com/nltk/nltk/issues?labels=model

As a matter of fact, the whole model module has been dropped from the NLTK-3.0a4 pre-release until the issues are fixed.

like image 73
fnl Avatar answered Sep 22 '22 06:09

fnl