Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Counting the Frequency of three words

I have the code below to find the frequencies of two word phrases. I need to do the same for three word phrases.

However the code below does not seem to work for 3 word phrases.

from collections import Counter
import re

sentence = "I love TV show makes me happy, I love also comedy show makes me feel like flying"
words = re.findall(r'\w+', sentence)
two_words = [' '.join(ws) for ws in zip(words, words[1:])]
wordscount = {w:f for w, f in Counter(two_words).most_common() if f > 1}
wordscount
{'show makes': 2, 'makes me': 2, 'I love': 2}
like image 932
K Bh Avatar asked Sep 21 '25 05:09

K Bh


2 Answers

You can use collections.Counter on an iterable of 3-word groupings. The latter is constructed via a generator comprehension and list slicing.

from collections import Counter

three_words = (words[i:i+3] for i in range(len(words)-2))
counts = Counter(map(tuple, three_words))
wordscount = {' '.join(word): freq for word, freq in counts.items() if freq > 1}

print(wordscount)

{'show makes me': 2}

Notice we don't use str.join until the very end to avoid unnecessary repeated string operations. In addition, tuple conversion is required for Counter as dict keys must be hashable.

like image 98
jpp Avatar answered Sep 22 '25 21:09

jpp


I suggest factoring the functionality out to a seperate function:

def nwise(iterable, n):
    """
    Iterate over n-grams of an iterable.
    Has a bit of an overhead compared to pairwise (although only during
    initialization), so the two functions are implemented independently.
    """
    iterables = [iter(iterable) for _ in range(n)]
    for index, it in enumerate(iterables):
        for _ in range(index):
            next(it)
    yield from zip(*iterables)

Then you can do

two_words = [" ".join(bigram) for bigram in nwise(words, 2))]

and

three_words = [" ".join(trigram) for trigram in nwise(words, 3))]

and so on. You can then use collections.Counter on top of that:

three_word_counts = Counter(" ".join(trigram) for trigram in nwise(words, 3))
like image 26
L3viathan Avatar answered Sep 22 '25 21:09

L3viathan