Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Rough string alignment in python

If I have two strings of equal length like the following:

'aaaaabbbbbccccc'
'bbbebcccccddddd'

Is there an efficient way to align the two such that the most letters as possible line up as shown below?

'aaaaabbbbbccccc-----'
'-----bbbebcccccddddd'

The only way I can think of doing this is brute force by editing the strings and then iterating through and comparing.

like image 211
The Nightman Avatar asked Mar 06 '17 20:03

The Nightman


1 Answers

Return the index which gives the maximum score, where the maximum score is the strings which have the most matching characters.

def best_overlap(a, b):
    return max([(score(a[offset:], b), offset) for offset in xrange(len(a))], key=lambda x: x[0])[1]

def score(a, b):
    return sum([a[i] == b[i] for i in xrange(len(a))])

>>> best_overlap(a, b)
5
>>> a + '-' * best_overlap(a, b); '-' * best_overlap(a, b) + b
'aaaaabbbbbccccc-----'
'-----bbbebcccccddddd'

Or, equivalently:

def best_match(a, b):
    max = 0
    max_score = 0
    for offset in xrange(len(a)):
        val = score(a[offset:], b)
        if val > max_score:
            max_score = val
            max = offset
    return max

There is room for optimizations such as:

  1. Early exit for no matching characters

  2. Early exit when maximum possible match found

like image 140
TemporalWolf Avatar answered Sep 28 '22 20:09

TemporalWolf