Given a long string L
and a shorter string S
(the constraint is that L
.length must be >= S
.length), I want to find the minimum Hamming distance between S
and any substring of L
with length equal to S
.length. Let's call the function for this minHamming()
. For example,
minHamming(ABCDEFGHIJ, CDEFGG) == 1
.
minHamming(ABCDEFGHIJ, BCDGHI) == 3
.
Doing this the obvious way (enumerating every substring of L) requires O(S
.length * L
.length) time. Is there any clever way to do this in sublinear time? I search the same L
with several different S
strings, so doing some complicated preprocessing to L
once is acceptable.
Edit: The modified Boyer-Moore would be a good idea, except that my alphabet is only 4 letters (DNA).
The Hamming distance between two codewords is defined as the number of elements in which they differ. The minimum distance dmin of a linear block code is the smallest Hamming distance between any two different codewords, and is equal to the minimum Hamming weight of the non-zero codewords in the code.
Error detection and error correction In particular, a code C is said to be k error detecting if, and only if, the minimum Hamming distance between any two of its codewords is at least k+1. For example, consider the code consisting of two codewords "000" and "111".
For example, if we add five bits, giving a minimum Hamming distance of 6, we can detect single, double, and triple errors, and correct single and double ones.
Perhaps surprisingly, this exact problem can be solved in just O(|A|nlog n) time using Fast Fourier Transforms (FFTs), where n is the length of the larger sequence L
and |A|
is the size of the alphabet.
Here is a freely available PDF of a paper by Donald Benson describing how it works:
Summary: Convert each of your strings S
and L
into several indicator vectors (one per character, so 4 in the case of DNA), and then convolve corresponding vectors to determine match counts for each possible alignment. The trick is that convolution in the "time" domain, which ordinarily requires O(n^2) time, can be implemented using multiplication in the "frequency" domain, which requires just O(n) time, plus the time required to convert between domains and back again. Using the FFT each conversion takes just O(nlog n) time, so the overall time complexity is O(|A|nlog n). For greatest speed, finite field FFTs are used, which require only integer arithmetic.
Note: For arbitrary S
and L
this algorithm is clearly a huge performance win over the straightforward O(mn) algorithm as |S|
and |L|
become large, but OTOH if S
is typically shorter than log|L|
(e.g. when querying a large DB with a small sequence), then obviously this approach provides no speedup.
UPDATE 21/7/2009: Updated to mention that the time complexity also depends linearly on the size of the alphabet, since a separate pair of indicator vectors must be used for each character in the alphabet.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With