I would like an efficient Pythonic way to count neighbouring word pairs in text. Efficient because it needs to wok well with larger datasets.
The way the count is done is important too.
Consider this simplified example:
words_list = "apple banana banana apple".split()
I can create neighbouring pairs using:
word_pair_list = zip(words_list[:-1], words_list[1:])
I can then count them Pythonically using
word_pair_ctr = collections.Counter(word_pair_list)
This gives me
(('apple', 'banana'), 1)
(('banana', 'banana'), 1)
(('banana', 'apple'), 1)
Note that 'apple'
and 'apple'
are not a neighbouring pair.
But I want the order of the pairs not to count. That means the ('apple', 'banana')
and ('banana', 'apple')
should be considered the same, and the counts should be
(('apple', 'banana'), 2)
(('banana', 'banana'), 1)
I can't find a Pythonic way of doing this that doesn't need me to visit each item in the word list, which becomes inefficient for larger text.
I'm happy to use the common scipy, numpy and pandas as libraries too.
You could use a modified version of pairwise function from the official documentation (https://docs.python.org/3.8/library/itertools.html) in order to read your list by pair and, at the same time, reorder the members of each pair:
l = "apple banana banana apple".split()
def pairwise(iterable):
"""s -> (s0,s1), (s1,s2), (s2, s3), ..."""
a, b = itertools.tee(iterable)
next(b, None)
return ((a, b) if a < b else (b, a) for a, b in zip(a, b))
>>> list(pairwise(l))
<class 'list'>: ['apple', 'banana', 'banana', 'apple']
>>> collections.Counter(pairwise(l))
Counter({('apple', 'banana'): 2, ('banana', 'banana'): 1})
Hope this helps!
There are several solutions with builtins.
word_pair_list
to frozenset
:word_pair_ctr = collections.Counter(map(frozenset, word_pair_list))
The result:
Counter({frozenset({'apple', 'banana'}): 2, frozenset({'banana'}): 1})
The second set might look weird, but that's only because sets contain only ever one identical element. Retrieval will still work, i.e. word_pair_ctr[frozenset(["banana", "banana"])]
equals 1.
You need to use frozenset as opposed to a normal set, because normal sets aren't hashable, and can therefore not be keys in a dictionary (or Counter).
word_pair_ctr = collections.Counter(map(lambda x: tuple(sorted(x)), word_pair_list))
The result looks like this:
Counter({('apple', 'banana'): 2, ('banana', 'banana'): 1})
While this might look nicer, you'll have to make sure that you access the counts in the same way, i.e. word_pair_ctr[tuple(sorted([word1, word2]))]
, which might feel even more convoluted than the previous solution.
The third option is to make your own counter class that does all this for you.
class BiDirectionalCounter(collections.Counter):
def __init__(self, iterable):
super().__init__(map(lambda x: tuple(sorted(x)), iterable))
def __getitem__(self, items):
return super().__getitem__(tuple(sorted(items)))
This seemingly works:
>>> BidirectionalCounter(word_pair_list)
BidirectionalCounter({('apple', 'banana'): 2, ('banana', 'banana'): 1})
But to truly work, you'd need to implement all of the relevant dunder methods, i.e. __setitem__
, __add__
, __iadd__
, ...
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With