Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficient Python for word pair co-occurrence counting?

I would like an efficient Pythonic way to count neighbouring word pairs in text. Efficient because it needs to wok well with larger datasets.

The way the count is done is important too.

Consider this simplified example:

words_list = "apple banana banana apple".split()

I can create neighbouring pairs using:

word_pair_list = zip(words_list[:-1], words_list[1:])

I can then count them Pythonically using

word_pair_ctr = collections.Counter(word_pair_list)

This gives me

(('apple', 'banana'), 1)
(('banana', 'banana'), 1)
(('banana', 'apple'), 1)

Note that 'apple' and 'apple' are not a neighbouring pair.

But I want the order of the pairs not to count. That means the ('apple', 'banana') and ('banana', 'apple') should be considered the same, and the counts should be

(('apple', 'banana'), 2)
(('banana', 'banana'), 1)

I can't find a Pythonic way of doing this that doesn't need me to visit each item in the word list, which becomes inefficient for larger text.

I'm happy to use the common scipy, numpy and pandas as libraries too.

like image 286
Intuitive Text Mining Avatar asked Jan 22 '19 13:01

Intuitive Text Mining


2 Answers

You could use a modified version of pairwise function from the official documentation (https://docs.python.org/3.8/library/itertools.html) in order to read your list by pair and, at the same time, reorder the members of each pair:

l = "apple banana banana apple".split()
def pairwise(iterable):
    """s -> (s0,s1), (s1,s2), (s2, s3), ..."""
    a, b = itertools.tee(iterable)
    next(b, None)
    return ((a, b) if a < b else (b, a) for a, b in zip(a, b))
>>> list(pairwise(l))
<class 'list'>: ['apple', 'banana', 'banana', 'apple']
>>> collections.Counter(pairwise(l))
Counter({('apple', 'banana'): 2, ('banana', 'banana'): 1})

Hope this helps!

like image 88
Julien Briand Avatar answered Nov 07 '22 02:11

Julien Briand


There are several solutions with builtins.

Mapping the word_pair_list to frozenset:

word_pair_ctr = collections.Counter(map(frozenset, word_pair_list))

The result:

Counter({frozenset({'apple', 'banana'}): 2, frozenset({'banana'}): 1})

The second set might look weird, but that's only because sets contain only ever one identical element. Retrieval will still work, i.e. word_pair_ctr[frozenset(["banana", "banana"])] equals 1.

You need to use frozenset as opposed to a normal set, because normal sets aren't hashable, and can therefore not be keys in a dictionary (or Counter).

Sort the pairs before insertion into the Counter:

word_pair_ctr = collections.Counter(map(lambda x: tuple(sorted(x)), word_pair_list))

The result looks like this:

Counter({('apple', 'banana'): 2, ('banana', 'banana'): 1})

While this might look nicer, you'll have to make sure that you access the counts in the same way, i.e. word_pair_ctr[tuple(sorted([word1, word2]))], which might feel even more convoluted than the previous solution.

Subclass Counter

The third option is to make your own counter class that does all this for you.

class BiDirectionalCounter(collections.Counter):
    def __init__(self, iterable):
        super().__init__(map(lambda x: tuple(sorted(x)), iterable))
    def __getitem__(self, items):
        return super().__getitem__(tuple(sorted(items)))

This seemingly works:

>>> BidirectionalCounter(word_pair_list)
BidirectionalCounter({('apple', 'banana'): 2, ('banana', 'banana'): 1})

But to truly work, you'd need to implement all of the relevant dunder methods, i.e. __setitem__, __add__, __iadd__, ...

like image 2
L3viathan Avatar answered Nov 07 '22 02:11

L3viathan