Counting bigrams (pair of two words) in a file using Python

I want to count the number of occurrences of all bigrams (pair of adjacent words) in a file using python. Here, I am dealing with very large files, so I am looking for an efficient way. I tried using count method with regex "\w+\s\w+" on file contents, but it did not prove to be efficient.

e.g. Let's say I want to count the number of bigrams from a file a.txt, which has following content:

"the quick person did not realize his speed and the quick person bumped "

For above file, the bigram set and their count will be :

(the,quick) = 2 (quick,person) = 2 (person,did) = 1 (did, not) = 1 (not, realize) = 1 (realize,his) = 1 (his,speed) = 1 (speed,and) = 1 (and,the) = 1 (person, bumped) = 1

I have come across an example of Counter objects in Python, which is used to count unigrams (single words). It also uses regex approach.

The example goes like this:

>>> # Find the ten most common words in Hamlet >>> import re >>> from collections import Counter >>> words = re.findall('\w+', open('a.txt').read()) >>> print Counter(words)

The output of above code is :

[('the', 2), ('quick', 2), ('person', 2), ('did', 1), ('not', 1),  ('realize', 1),  ('his', 1), ('speed', 1), ('bumped', 1)]

I was wondering if it is possible to use the Counter object to get count of bigrams. Any approach other than Counter object or regex will also be appreciated.

How do I get bigrams in Python?

First, we need to generate such word pairs from the existing sentence maintain their current sequences. Such pairs are called bigrams. Python has a bigram function as part of NLTK library which helps us generate these pairs.

What is bigram count?

A bigram frequency measures how often a pair of letters occurs. For instance, take the ratio of the number of times 'c' comes before 'd' (1 time) with the total number of pairs (64 times).

What is Unigrams and bigrams in Python?

From the table above, it's clear that unigram means taking only one word at a time, bigram means taking two words at a time and trigram means taking three words at a time.

Some itertools magic:

>>> import re >>> from itertools import islice, izip >>> words = re.findall("\w+",     "the quick person did not realize his speed and the quick person bumped") >>> print Counter(izip(words, islice(words, 1, None)))

Output:

Counter({('the', 'quick'): 2, ('quick', 'person'): 2, ('person', 'did'): 1,    ('did', 'not'): 1, ('not', 'realize'): 1, ('and', 'the'): 1,    ('speed', 'and'): 1, ('person', 'bumped'): 1, ('his', 'speed'): 1,    ('realize', 'his'): 1})

Bonus

Get the frequency of any n-gram:

from itertools import tee, islice  def ngrams(lst, n):   tlst = lst   while True:     a, b = tee(tlst)     l = tuple(islice(a, n))     if len(l) == n:       yield l       next(b)       tlst = b     else:       break  >>> Counter(ngrams(words, 3))

Output:

Counter({('the', 'quick', 'person'): 2, ('and', 'the', 'quick'): 1,    ('realize', 'his', 'speed'): 1, ('his', 'speed', 'and'): 1,    ('person', 'did', 'not'): 1, ('quick', 'person', 'did'): 1,    ('quick', 'person', 'bumped'): 1, ('did', 'not', 'realize'): 1,    ('speed', 'and', 'the'): 1, ('not', 'realize', 'his'): 1})

This works with lazy iterables and generators too. So you can write a generator which reads a file line by line, generating words, and pass it to ngarms to consume lazily without reading the whole file in memory.

How about zip()?

import re from collections import Counter words = re.findall('\w+', open('a.txt').read()) print(Counter(zip(words,words[1:])))

Counting bigrams (pair of two words) in a file using Python

Tags:

python

regex

n-gram

swap310

People also ask

2 Answers

Abhinav Sarkar

st0le

Recent Activity

Donate For Us

Counting bigrams (pair of two words) in a file using Python

Tags:

python

regex

n-gram

swap310

People also ask

2 Answers

Abhinav Sarkar

st0le

Related questions

Recent Activity

Donate For Us