Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Inverse of Hamming Distance

Tags:

*This is a brief introduction, the specific question is in bold at the last paragraph.

I'm trying to generate all strings with a given Hamming Distance to solve efficiently a bioinformatic assignment.

The idea is, given a string (ie. 'ACGTTGCATGTCGCATGATGCATGAGAGCT'), the length of the word to search (ie. 4) and the acceptable mismatches when searching that word in the string (ie. 1), return the most frequent words or 'mutated' words.

To be clear, a word of length 4 from the given string can be this (between '[ ]'):

[ACGT]TGCATGTCGCATGATGCATGAGAGCT #ACGT 

this

A[CGTT]GCATGTCGCATGATGCATGAGAGCT #CGTT 

or this

ACGTTGCATGTCGCATGATGCATGAG[AGCT] #AGCT 

What I did was (and its very inefficiently, and its really slow when the words need to have 10 characters) generate all possible words with the given distance:

itertools.imap(''.join, itertools.product('ATCG', repeat=wordSize)) 

and then search and compare every word in the given string if the generated words (or its mutation) appears in a loop:

wordFromString = givenString[i:i+wordSize] mismatches = sum(ch1 != ch2 for ch1, ch2 in zip(wordFromString, generatedWord)) if mismatches <= d:     #count that generated word in a list for future use     #(only need the most repeated) 

What I want to do is, instead of generating ALL possible words, generate just the mutations of the words that appear in the given string with a given number of mismatches, in other words, given a Hamming Distance and a word, return all the possible mutated words with that (or less) distance, and then use them for searching in the given string.

I hope I was clear. Thank you.

like image 527
JackS Avatar asked Nov 12 '13 22:11

JackS


People also ask

What is Hamming distance formula?

To calculate the Hamming distance, you simply count the number of bits where two same-length messages differ. An example of Hamming distance 1 is the distance between 1101 and 1001 . If you increase the distance to 2 , we can give as an example 1001 and 1010 .

Is Hamming a XOR distance?

In order to calculate the Hamming distance between two strings, and , we perform their XOR operation, (a⊕ b), and then count the total number of 1s in the resultant string.

How does Hamming distance become Manhattan distance?

by treating each symbol in the string as a real coordinate; with this embedding, the strings form the vertices of an n-dimensional hypercube, and the Hamming distance of the strings is equivalent to the Manhattan distance between the vertices.

What is normalized Hamming distance?

Normalized Hamming distance is the ratio of the Hamming distance to the length of the string. A lower value of Normalized Hamming distance suggests the two strings are more similar. The Levenshtein distance can be computed to find the distance between the two strings and also find the number of edits.


1 Answers

def mutations(word, hamming_distance, charset='ATCG'):     for indices in itertools.combinations(range(len(word)), hamming_distance):         for replacements in itertools.product(charset, repeat=hamming_distance):             mutation = list(word)             for index, replacement in zip(indices, replacements):                 mutation[index] = replacement             yield "".join(mutation) 

This function generates all mutations of a word with a hamming distance less than or equal to a given number. It is relatively efficient, and does not check invalid words. However, valid mutations can appear more than once. Use a set if you want every element to be unique.

like image 89
Joshua Avatar answered Oct 18 '22 12:10

Joshua