Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What does "word count" refer to when calculating unigram probabilities in an unigram language model?

Tags:

nlp

I'm using an unigram language model. I want to calculate the probability of each unigram. Should I divide the number of occurrences of an unigram with the number of distinct unigrams, or by the count of all unigrams?

like image 445
vikifor Avatar asked Apr 25 '13 22:04

vikifor


People also ask

How is unigram probability calculated?

Training an -gram model is easy. To estimate the probabilities of a unigram language model, just count the number of times each word occurs and divide it by the total number of words: () = () ( ) .

How do you find the probability of a bigram?

The bigram probability is calculated by dividing the number of times the string “prime minister” appears in the given corpus by the total number of times the word “prime” appears in the same corpus.


1 Answers

Divide by the total number of tokens, i.e. word occurrences, in the training set. The reason is quite easy to see: if you divide by the number of distinct words, the probabilities for all words will not necessarily sum to one so they won't form a probability distribution.

like image 164
Fred Foo Avatar answered Oct 03 '22 05:10

Fred Foo