How to get the probability of bigrams in a text of sentences?

Tags:

I have a text which has many sentences. How can I use nltk.ngrams to process it?

This is my code:

   sequence = nltk.tokenize.word_tokenize(raw) 
   bigram = ngrams(sequence,2)
   freq_dist = nltk.FreqDist(bigram)
   prob_dist = nltk.MLEProbDist(freq_dist)
   number_of_bigrams = freq_dist.N()

However, the above code supposes that all sentences are one sequence. But, sentences are separated, and I guess the last word of one sentence is unrelated to the start word of another sentence. How can I create a bigram for such a text? I need also prob_dist and number_of_bigrams which are based on the `freq_dist.

There are similar questions like this What are ngram counts and how to implement using nltk? but they are mostly about a sequence of words.

733

asked Mar 02 '19 20:03

Ahmad

1 Answers

You can use the new nltk.lm module. Here's an example, first get some data and tokenize it:

import os
import requests
import io #codecs

from nltk import word_tokenize, sent_tokenize 

# Text version of https://kilgarriff.co.uk/Publications/2005-K-lineer.pdf
if os.path.isfile('language-never-random.txt'):
    with io.open('language-never-random.txt', encoding='utf8') as fin:
        text = fin.read()
else:
    url = "https://gist.githubusercontent.com/alvations/53b01e4076573fea47c6057120bb017a/raw/b01ff96a5f76848450e648f35da6497ca9454e4a/language-never-random.txt"
    text = requests.get(url).content.decode('utf8')
    with io.open('language-never-random.txt', 'w', encoding='utf8') as fout:
        fout.write(text)

# Tokenize the text.
tokenized_text = [list(map(str.lower, word_tokenize(sent))) 
              for sent in sent_tokenize(text)]

Then the language modelling:

# Preprocess the tokenized text for 3-grams language modelling
from nltk.lm.preprocessing import padded_everygram_pipeline
from nltk.lm import MLE

n = 3
train_data, padded_sents = padded_everygram_pipeline(n, tokenized_text)

model = MLE(n) # Lets train a 3-grams maximum likelihood estimation model.
model.fit(train_data, padded_sents)

To get the counts:

model.counts['language'] # i.e. Count('language')
model.counts[['language']]['is'] # i.e. Count('is'|'language')
model.counts[['language', 'is']]['never'] # i.e. Count('never'|'language is')

To get the probabilities:

model.score('is', 'language'.split())  # P('is'|'language')
model.score('never', 'language is'.split())  # P('never'|'language is')

There's some kinks on the Kaggle platform when loading the notebook but at some point this notebook should give a good overview of the nltk.lm module https://www.kaggle.com/alvations/n-gram-language-model-with-nltk

196

answered Nov 14 '22 23:11

alvas

Related questions
                            
                                Python Merge Two Numpy Arrays Based on Condition
                            
                                How can I train my Python based OCR with Tesseract to train with different National Identity Cards?
                            
                                Why can’t you use Hash Tables/Dictionaries in Counting Sort algorithm?
                            
                                pytest can't see logs from function being tested
                            
                                How can I get around Keras pad_sequences() rounding float values to zero?
                            
                                delete leap days in pandas
                            
                                Adding a new column with specific dtype in pandas
                            
                                Can't install numpy after a pip upgrade
                            
                                "Feather" library installation failing in PyCharm
                            
                                How to write a regular expression utilizing the Robot Framework to find/replace various date strings
                            
                                How can I make a psycopg2 connection using environment variables?
                            
                                Tensorflow error "has type list, but expected one of: int, long, float"
                            
                                How to run a Method on the exit of a kivy app
                            
                                Reversing string characters while keeping them in the same position
                            
                                create an image with border of certain width in python
                            
                                Unable to connect to flask while running on docker container [duplicate]
                            
                                How to calculate class weights of a Pandas DataFrame for Keras?
                            
                                What is the best way to change the widget type in an hvplot/holoviews/panel object?
                            
                                Seaborn & Matplotlib Adding Text Relative to Axes
                            
                                Cannot connect to Jupyter Notebook

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to get the probability of bigrams in a text of sentences?

Tags:

python

python-3.x

nltk

n-gram

Ahmad

People also ask

1 Answers

alvas

Recent Activity

Donate For Us