Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Longest common sequence of words from more than two strings

Tags:

python

string

I am trying to find the longest common sequence of words in a list of sentences (more than two sentences).

Example:

list = ['commercial van for movers', 'partial van for movers', 'commercial van for moving' ]
sents = pd.Series(list)

In this answer, the solution works fine but it captures part words and it returns the follow:

'ial van for mov'

The output should be

'van for'

I couldn't find a way to modify it to return the desired output

like image 692
mallet Avatar asked Oct 18 '25 11:10

mallet


1 Answers

The key is to modify to search by whole-word subsequences.

from itertools import islice

def is_sublist(source, target):
    slen = len(source)
    return any(all(item1 == item2 for (item1, item2) in zip(source, islice(target, i, i+slen))) for i in range(len(target) - slen + 1))

def long_substr_by_word(data):
    subseq = []
    data_seqs = [s.split(' ') for s in data]
    if len(data_seqs) > 1 and len(data_seqs[0]) > 0:
        for i in range(len(data_seqs[0])):
            for j in range(len(data_seqs[0])-i+1):
                if j > len(subseq) and all(is_sublist(data_seqs[0][i:i+j], x) for x in data_seqs):
                    subseq = data_seqs[0][i:i+j]
    return ' '.join(subseq)

Demo:

>>> data = ['commercial van for movers',
...         'partial van for movers',
...         'commercial van for moving']
>>> long_substr_by_word(data)
'van for'
>>>
>>> data = ['a bx bx z', 'c bx bx zz']
>>> long_substr_by_word(data)
'bx bx'
like image 84
Steven Rumbalski Avatar answered Oct 21 '25 02:10

Steven Rumbalski