I want to count the number of occurrences of all bigrams (pair of adjacent words) in a file using python. Here, I am dealing with very large files, so I am looking for an efficient way. I tried using count method with regex "\w+\s\w+" on file contents, but it did not prove to be efficient.
e.g. Let's say I want to count the number of bigrams from a file a.txt, which has following content:
"the quick person did not realize his speed and the quick person bumped "
For above file, the bigram set and their count will be :
(the,quick) = 2 (quick,person) = 2 (person,did) = 1 (did, not) = 1 (not, realize) = 1 (realize,his) = 1 (his,speed) = 1 (speed,and) = 1 (and,the) = 1 (person, bumped) = 1
I have come across an example of Counter objects in Python, which is used to count unigrams (single words). It also uses regex approach.
The example goes like this:
>>> # Find the ten most common words in Hamlet >>> import re >>> from collections import Counter >>> words = re.findall('\w+', open('a.txt').read()) >>> print Counter(words)
The output of above code is :
[('the', 2), ('quick', 2), ('person', 2), ('did', 1), ('not', 1), ('realize', 1), ('his', 1), ('speed', 1), ('bumped', 1)]
I was wondering if it is possible to use the Counter object to get count of bigrams. Any approach other than Counter object or regex will also be appreciated.
First, we need to generate such word pairs from the existing sentence maintain their current sequences. Such pairs are called bigrams. Python has a bigram function as part of NLTK library which helps us generate these pairs.
A bigram frequency measures how often a pair of letters occurs. For instance, take the ratio of the number of times 'c' comes before 'd' (1 time) with the total number of pairs (64 times).
From the table above, it's clear that unigram means taking only one word at a time, bigram means taking two words at a time and trigram means taking three words at a time.
Some itertools
magic:
>>> import re >>> from itertools import islice, izip >>> words = re.findall("\w+", "the quick person did not realize his speed and the quick person bumped") >>> print Counter(izip(words, islice(words, 1, None)))
Output:
Counter({('the', 'quick'): 2, ('quick', 'person'): 2, ('person', 'did'): 1, ('did', 'not'): 1, ('not', 'realize'): 1, ('and', 'the'): 1, ('speed', 'and'): 1, ('person', 'bumped'): 1, ('his', 'speed'): 1, ('realize', 'his'): 1})
Bonus
Get the frequency of any n-gram:
from itertools import tee, islice def ngrams(lst, n): tlst = lst while True: a, b = tee(tlst) l = tuple(islice(a, n)) if len(l) == n: yield l next(b) tlst = b else: break >>> Counter(ngrams(words, 3))
Output:
Counter({('the', 'quick', 'person'): 2, ('and', 'the', 'quick'): 1, ('realize', 'his', 'speed'): 1, ('his', 'speed', 'and'): 1, ('person', 'did', 'not'): 1, ('quick', 'person', 'did'): 1, ('quick', 'person', 'bumped'): 1, ('did', 'not', 'realize'): 1, ('speed', 'and', 'the'): 1, ('not', 'realize', 'his'): 1})
This works with lazy iterables and generators too. So you can write a generator which reads a file line by line, generating words, and pass it to ngarms
to consume lazily without reading the whole file in memory.
How about zip()
?
import re from collections import Counter words = re.findall('\w+', open('a.txt').read()) print(Counter(zip(words,words[1:])))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With