I am trying to find the longest common sequence of words in a list of sentences (more than two sentences).
Example:
list = ['commercial van for movers', 'partial van for movers', 'commercial van for moving' ]
sents = pd.Series(list)
In this answer, the solution works fine but it captures part words and it returns the follow:
'ial van for mov'
The output should be
'van for'
I couldn't find a way to modify it to return the desired output
The key is to modify to search by whole-word subsequences.
from itertools import islice
def is_sublist(source, target):
slen = len(source)
return any(all(item1 == item2 for (item1, item2) in zip(source, islice(target, i, i+slen))) for i in range(len(target) - slen + 1))
def long_substr_by_word(data):
subseq = []
data_seqs = [s.split(' ') for s in data]
if len(data_seqs) > 1 and len(data_seqs[0]) > 0:
for i in range(len(data_seqs[0])):
for j in range(len(data_seqs[0])-i+1):
if j > len(subseq) and all(is_sublist(data_seqs[0][i:i+j], x) for x in data_seqs):
subseq = data_seqs[0][i:i+j]
return ' '.join(subseq)
Demo:
>>> data = ['commercial van for movers',
... 'partial van for movers',
... 'commercial van for moving']
>>> long_substr_by_word(data)
'van for'
>>>
>>> data = ['a bx bx z', 'c bx bx zz']
>>> long_substr_by_word(data)
'bx bx'
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With