Efficient Python for word pair co-occurrence counting?

Question

I would like an efficient Pythonic way to count neighbouring word pairs in text. Efficient because it needs to wok well with larger datasets.

The way the count is done is important too.

Consider this simplified example:

words_list = "apple banana banana apple".split()

I can create neighbouring pairs using:

word_pair_list = zip(words_list[:-1], words_list[1:])

I can then count them Pythonically using

word_pair_ctr = collections.Counter(word_pair_list)

This gives me

(('apple', 'banana'), 1)
(('banana', 'banana'), 1)
(('banana', 'apple'), 1)

Note that 'apple' and 'apple' are not a neighbouring pair.

But I want the order of the pairs not to count. That means the ('apple', 'banana') and ('banana', 'apple') should be considered the same, and the counts should be

(('apple', 'banana'), 2)
(('banana', 'banana'), 1)

I can't find a Pythonic way of doing this that doesn't need me to visit each item in the word list, which becomes inefficient for larger text.

I'm happy to use the common scipy, numpy and pandas as libraries too.

Julien Briand · Accepted Answer

You could use a modified version of pairwise function from the official documentation (https://docs.python.org/3.8/library/itertools.html) in order to read your list by pair and, at the same time, reorder the members of each pair:

l = "apple banana banana apple".split()
def pairwise(iterable):
    """s -> (s0,s1), (s1,s2), (s2, s3), ..."""
    a, b = itertools.tee(iterable)
    next(b, None)
    return ((a, b) if a < b else (b, a) for a, b in zip(a, b))
>>> list(pairwise(l))
<class 'list'>: ['apple', 'banana', 'banana', 'apple']
>>> collections.Counter(pairwise(l))
Counter({('apple', 'banana'): 2, ('banana', 'banana'): 1})

Hope this helps!

L3viathan · Answer

There are several solutions with builtins.

Mapping the `word_pair_list` to `frozenset`:

word_pair_ctr = collections.Counter(map(frozenset, word_pair_list))

The result:

Counter({frozenset({'apple', 'banana'}): 2, frozenset({'banana'}): 1})

The second set might look weird, but that's only because sets contain only ever one identical element. Retrieval will still work, i.e. word_pair_ctr[frozenset(["banana", "banana"])] equals 1.

You need to use frozenset as opposed to a normal set, because normal sets aren't hashable, and can therefore not be keys in a dictionary (or Counter).

Sort the pairs before insertion into the Counter:

word_pair_ctr = collections.Counter(map(lambda x: tuple(sorted(x)), word_pair_list))

The result looks like this:

Counter({('apple', 'banana'): 2, ('banana', 'banana'): 1})

While this might look nicer, you'll have to make sure that you access the counts in the same way, i.e. word_pair_ctr[tuple(sorted([word1, word2]))], which might feel even more convoluted than the previous solution.

Subclass Counter

The third option is to make your own counter class that does all this for you.

class BiDirectionalCounter(collections.Counter):
    def __init__(self, iterable):
        super().__init__(map(lambda x: tuple(sorted(x)), iterable))
    def __getitem__(self, items):
        return super().__getitem__(tuple(sorted(items)))

This seemingly works:

>>> BidirectionalCounter(word_pair_list)
BidirectionalCounter({('apple', 'banana'): 2, ('banana', 'banana'): 1})

But to truly work, you'd need to implement all of the relevant dunder methods, i.e. __setitem__, __add__, __iadd__, ...

Efficient Python for word pair co-occurrence counting?

Tags:

python

pandas

natural-language-processing

Intuitive Text Mining

2 Answers

Julien Briand

Mapping the `word_pair_list` to `frozenset`:

Sort the pairs before insertion into the Counter:

Subclass Counter

L3viathan

Recent Activity

Donate For Us

Efficient Python for word pair co-occurrence counting?

Tags:

python

pandas

natural-language-processing

Intuitive Text Mining

2 Answers

Julien Briand

Mapping the word_pair_list to frozenset:

Sort the pairs before insertion into the Counter:

Subclass Counter

L3viathan

Related questions

Recent Activity

Donate For Us

Mapping the `word_pair_list` to `frozenset`: