Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

list of maximal common connected substrings of two strings

Tags:

algorithm

Suppose we have two strings "abcdefgh" and "abudesh". I want for solution to be a list ["ab", "de", "h"]. So, I want a list of maximally connected substrings that are the same for both strings. Is this has a name and what would be a good approach in solving it?

Edit: I need to say that the order is not important in the way that if we have, for example, two strings "abcdefg" and "defkabc", the result is ["abc", "def"].

like image 836
Alem Avatar asked Dec 07 '25 02:12

Alem


1 Answers

Using:

  • Biopython's pairwise2 to align the two sequences;
  • itertools.groupby to group the "maximally connected substrings".
from Bio import pairwise2
from itertools import groupby

def maxConnectedSubstrings(strA, strB):
    alignment = pairwise2.align.globalxx(strA, strB)[0]
    grouped = groupby(zip(alignment.seqA, alignment.seqB), key=lambda p: p[0] == p[1])
    return [''.join(ca for ca,cb in g) for k,g in grouped if k]

print( maxConnectedSubstrings('abcdefgh', 'abudesh') )
# ['ab', 'de', 'h']

Explanation

First, we align the sequences. The result of alignment = pairwise2.align.globalxx(strA, strB)[0] is:

alignment.seqA = 'abcdefgh'
alignment.seqB = 'abude-sh'

The alignment algorithm found the best way to add '-' in the sequences to align them.

Then, we use groupby on zip(alignment.seqA, alignment.seqB). The zip(...) is a sequence of pairs (character from seqA, character from seqB). We group these pairs with the key lambda p: p[0] == p[1], which gives the following result:

grouped = groupby(zip(alignment.seqA, alignment.seqB), key=lambda p: p[0] == p[1])

grouped = [
    (True,  [('a', 'a'),
             ('b', 'b')]),
    (False, [('c', 'u')]),
    (True,  [('d', 'd'),
             ('e', 'e')]),
    (False, [('f', '-'),
             ('g', 's')]),
    (True,  [('h', 'h')])
]

Finally, we discard the False groups, and we join the letters of every True group.

like image 59
Stef Avatar answered Dec 08 '25 15:12

Stef