What does "word count" refer to when calculating unigram probabilities in an unigram language model?

Tags:

nlp

I'm using an unigram language model. I want to calculate the probability of each unigram. Should I divide the number of occurrences of an unigram with the number of distinct unigrams, or by the count of all unigrams?

445

asked Apr 25 '13 22:04

vikifor

1 Answers

Divide by the total number of tokens, i.e. word occurrences, in the training set. The reason is quite easy to see: if you divide by the number of distinct words, the probabilities for all words will not necessarily sum to one so they won't form a probability distribution.

164

answered Oct 03 '22 05:10

Fred Foo

Related questions
                            
                                Spacy - nlp.pipe() returns generator
                            
                                Lemmatize a doc with spacy?
                            
                                How can a machine learning model handle unseen data and unseen label?
                            
                                How to get token ids using spaCy (I want to map a text sentence to sequence of integers)
                            
                                `return_sequences = False` equivalent in pytorch LSTM
                            
                                How to find singular in the plural when some letters change? What is the best approach?
                            
                                Natural Language Processing Package
                            
                                Anyone know of some good Word Sense Disambiguation software? [closed]
                            
                                stanford Core NLP: Splitting sentences from text
                            
                                Algorithm to generate context free grammar from any regex
                            
                                Lexicon dictionary for synonym words
                            
                                Difference between Semantic Web and NLP?
                            
                                How to abstract bigram topics instead of unigrams using Latent Dirichlet Allocation (LDA) in python- gensim?
                            
                                No such file or directory 'nltk_data/corpora/stopwords/English' when using colab
                            
                                Spacy similarity warning : "Evaluating Doc.similarity based on empty vectors."
                            
                                How nltk.TweetTokenizer different from nltk.word_tokenize?
                            
                                How to create the negative of a sentence in nltk
                            
                                What is Two-Level Morphology?
                            
                                how to write spacy matcher of POS regex
                            
                                NLP: Building (small) corpora, or "Where to get lots of not-too-specialized English-language text files?"

What does "word count" refer to when calculating unigram probabilities in an unigram language model?

Tags:

nlp

vikifor

People also ask

1 Answers

Fred Foo

Recent Activity

Donate For Us