I'm using an unigram language model. I want to calculate the probability of each unigram. Should I divide the number of occurrences of an unigram with the number of distinct unigrams, or by the count of all unigrams?
Training an -gram model is easy. To estimate the probabilities of a unigram language model, just count the number of times each word occurs and divide it by the total number of words: () = () ( ) .
The bigram probability is calculated by dividing the number of times the string “prime minister” appears in the given corpus by the total number of times the word “prime” appears in the same corpus.
Divide by the total number of tokens, i.e. word occurrences, in the training set. The reason is quite easy to see: if you divide by the number of distinct words, the probabilities for all words will not necessarily sum to one so they won't form a probability distribution.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With