Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Finding Minimum hamming distance of a set of strings in python

I have a set of n (~1000000) strings (DNA sequences) stored in a list trans. I have to find the minimum hamming distance of all sequences in the list. I implemented a naive brute force algorithm, which has been running for more than a day and has not yet given a solution. My code is

dmin=len(trans[0])
for i in xrange(len(trans)):
    for j in xrange(i+1,len(trans)):
            dist=hamdist(trans[i][:-1], trans[j][:-1])
            if dist < dmin:
                    dmin = dist

Is there a more efficient method to do this? Here hamdist is a function I wrote to find hamming distances. It is

def hamdist(str1, str2):
    diffs = 0
    if len(str1) != len(str2):
        return max(len(str1),len(str2))
    for ch1, ch2 in zip(str1, str2):
        if ch1 != ch2:
          diffs += 1
    return diffs
like image 361
Devil Avatar asked Jul 08 '14 05:07

Devil


People also ask

How do you find the minimum Hamming distance?

In order to calculate the Hamming distance between two strings, and , we perform their XOR operation, (a⊕ b), and then count the total number of 1s in the resultant string.

How do you calculate Hamming distance between strings?

Example: Lets say i have 2 strings of equal length "ABCD" and "ACCC". In this example first and second last are same in both string so you need to substitute 2 character in 2nd to make it same hence, hamming distance is 2 in this case. so total hamming distance is 1 + 1 = 2.

What is the Hamming distance of bit strings 10101 and 11110?

The Hamming distance d(10101, 11110) is 3 because 10101 ⊕ 11110 is 01011 (three 1s).


1 Answers

You could optimize your hamdist function by adding an optional parameter containing the minimum distance you have got so far, this way if diffs reaches that value you stop calculating the distance because this comparison will give you a greater distance than the minimum:

def hamdist(str1, str2,prevMin=None):
    diffs = 0
    if len(str1) != len(str2):
        return max(len(str1),len(str2))
    for ch1, ch2 in zip(str1, str2):
        if ch1 != ch2:
            diffs += 1
            if prevMin is not None and diffs>prevMin:
                return None
    return diffs 

You will need to adapt your main loop to work with None return value from hamdist:

dmin=len(trans[0])
for i in xrange(len(trans)):
    for j in xrange(i+1,len(trans)):
            dist=hamdist(trans[i][:-1], trans[j][:-1])
            if dist is not None and dist < dmin:
                    dmin = dist
like image 186
Pablo Francisco Pérez Hidalgo Avatar answered Oct 22 '22 12:10

Pablo Francisco Pérez Hidalgo