Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to get the probability of bigrams in a text of sentences?

I have a text which has many sentences. How can I use nltk.ngrams to process it?

This is my code:

   sequence = nltk.tokenize.word_tokenize(raw) 
   bigram = ngrams(sequence,2)
   freq_dist = nltk.FreqDist(bigram)
   prob_dist = nltk.MLEProbDist(freq_dist)
   number_of_bigrams = freq_dist.N()

However, the above code supposes that all sentences are one sequence. But, sentences are separated, and I guess the last word of one sentence is unrelated to the start word of another sentence. How can I create a bigram for such a text? I need also prob_dist and number_of_bigrams which are based on the `freq_dist.

There are similar questions like this What are ngram counts and how to implement using nltk? but they are mostly about a sequence of words.

like image 733
Ahmad Avatar asked Mar 02 '19 20:03

Ahmad


People also ask

How do you find the probability of a bigram?

The probability of the bigram occurring P (bigram) is jut the quotient of those. The conditional probability of word [1] give word [0] P (w [1] | w [0]) is the quotient of the number of occurrence of the bigram over the count of w [0].

How to generate bigrams from an existing sentence in Python?

First, we need to generate such word pairs from the existing sentence maintain their current sequences. Such pairs are called bigrams. Python has a bigram function as part of NLTK library which helps us generate these pairs. import nltk word_data = "The best performance can bring in sky high success."

What is the difference between bigram and trigram?

Bigrams: Bigram is 2 consecutive words in a sentence. E.g. “The boy is playing football”. The bigrams here are: Trigrams: Trigram is 3 consecutive words in a sentence.

What is the conditional probability of word [1] give word [0]?

The conditional probability of word [1] give word [0] P (w [1] | w [0]) is the quotient of the number of occurrence of the bigram over the count of w [0]. For example looking at the bigram ('some', 'text'):


1 Answers

You can use the new nltk.lm module. Here's an example, first get some data and tokenize it:

import os
import requests
import io #codecs

from nltk import word_tokenize, sent_tokenize 

# Text version of https://kilgarriff.co.uk/Publications/2005-K-lineer.pdf
if os.path.isfile('language-never-random.txt'):
    with io.open('language-never-random.txt', encoding='utf8') as fin:
        text = fin.read()
else:
    url = "https://gist.githubusercontent.com/alvations/53b01e4076573fea47c6057120bb017a/raw/b01ff96a5f76848450e648f35da6497ca9454e4a/language-never-random.txt"
    text = requests.get(url).content.decode('utf8')
    with io.open('language-never-random.txt', 'w', encoding='utf8') as fout:
        fout.write(text)

# Tokenize the text.
tokenized_text = [list(map(str.lower, word_tokenize(sent))) 
              for sent in sent_tokenize(text)]

Then the language modelling:

# Preprocess the tokenized text for 3-grams language modelling
from nltk.lm.preprocessing import padded_everygram_pipeline
from nltk.lm import MLE

n = 3
train_data, padded_sents = padded_everygram_pipeline(n, tokenized_text)

model = MLE(n) # Lets train a 3-grams maximum likelihood estimation model.
model.fit(train_data, padded_sents)

To get the counts:

model.counts['language'] # i.e. Count('language')
model.counts[['language']]['is'] # i.e. Count('is'|'language')
model.counts[['language', 'is']]['never'] # i.e. Count('never'|'language is')

To get the probabilities:

model.score('is', 'language'.split())  # P('is'|'language')
model.score('never', 'language is'.split())  # P('never'|'language is')

There's some kinks on the Kaggle platform when loading the notebook but at some point this notebook should give a good overview of the nltk.lm module https://www.kaggle.com/alvations/n-gram-language-model-with-nltk

like image 196
alvas Avatar answered Nov 14 '22 23:11

alvas