Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I merge overlapping strings in python?

I have some strings,

['SGALWDV', 'GALWDVP', 'ALWDVPS', 'LWDVPSP', 'WDVPSPV']

These strings partially overlap each other. If you manually overlapped them you would get:

SGALWDVPSPV

I want a way to go from the list of overlapping strings to the final compressed string in python. I feel like this must be a problem that someone has solved already and am trying to avoid reinventing the wheel. The methods I can imagine now are either brute force or involve getting more complicated by using biopython and sequence aligners than I would like. I have some simple short strings and just want to properly merge them in a simple way.

Does anyone have any advice on a nice way to do this in python? Thanks!

like image 288
Adam Price Avatar asked Jan 30 '23 05:01

Adam Price


2 Answers

Here is a quick sorting solution:

s = ['SGALWDV', 'GALWDVP', 'ALWDVPS', 'LWDVPSP', 'WDVPSPV']
new_s = sorted(s, key=lambda x:s[0].index(x[0]))
a = new_s[0]
b = new_s[-1]
final_s = a[:a.index(b[0])]+b

Output:

'SGALWDVPSPV'

This program sorts s by the value of the index of the first character of each element, in an attempt to find the string that will maximize the overlap distance between the first element and the desired output.

like image 171
Ajax1234 Avatar answered Jan 31 '23 20:01

Ajax1234


My proposed solution with a more challenging test list:

#strFrag = ['SGALWDV', 'GALWDVP', 'ALWDVPS', 'LWDVPSP', 'WDVPSPV']
strFrag = ['ALWDVPS', 'SGALWDV', 'LWDVPSP', 'WDVPSPV', 'GALWDVP', 'LWDVPSP', 'ALWDVPS']

for repeat in range(0, len(strFrag)-1):
    bestMatch = [2, '', ''] #overlap score (minimum value 3), otherStr index, assembled str portion
    for otherStr in strFrag[1:]:
        for x in range(0,len(otherStr)):
            if otherStr[x:] == strFrag[0][:len(otherStr[x:])]:
                if len(otherStr)-x > bestMatch[0]:
                    bestMatch = [len(otherStr)-x, strFrag.index(otherStr), otherStr[:x]+strFrag[0]]
            if otherStr[:-x] == strFrag[0][-len(otherStr[x:]):]:
                if x > bestMatch[0]:
                    bestMatch = [x, strFrag.index(otherStr), strFrag[0]+otherStr[-x:]]
    if bestMatch[0] > 2:
        strFrag[0] = bestMatch[2]
        strFrag = strFrag[:bestMatch[1]]+strFrag[bestMatch[1]+1:]

print(strFrag)       
print(strFrag[0])

Basically the code compares every string/fragment to the first in list and finds the best match (most overlap). It consolidates the list progressively, merging the best matches and removing the individual strings. Code assumes that there are no unfillable gaps between strings/fragments (Otherwise answer may not result in longest possible assembly. Can be solved by randomizing the starting string/fragment). Also assumes that the reverse complement is not present (poor assumption with contig assembly), which would result in nonsense/unmatchable strings/fragments. I've included a way to restrict the minimum match requirements (changing bestMatch[0] value) to prevent false matches. Last assumption is that all matches are exact. To enable flexibility in permitting mismatches when assembling the sequence makes the problem considerably more complex. I can provide a solution for assembling with mismatches upon request.

like image 43
Ghoti Avatar answered Jan 31 '23 20:01

Ghoti