Fastest way to find minimal Hamming distance to any substring?

Tags:

Given a long string L and a shorter string S (the constraint is that L.length must be >= S.length), I want to find the minimum Hamming distance between S and any substring of L with length equal to S.length. Let's call the function for this minHamming(). For example,

minHamming(ABCDEFGHIJ, CDEFGG) == 1.

minHamming(ABCDEFGHIJ, BCDGHI) == 3.

Doing this the obvious way (enumerating every substring of L) requires O(S.length * L.length) time. Is there any clever way to do this in sublinear time? I search the same L with several different S strings, so doing some complicated preprocessing to L once is acceptable.

Edit: The modified Boyer-Moore would be a good idea, except that my alphabet is only 4 letters (DNA).

627

asked Jul 17 '09 22:07

dsimcha

1 Answers

Perhaps surprisingly, this exact problem can be solved in just O(|A|nlog n) time using Fast Fourier Transforms (FFTs), where n is the length of the larger sequence L and |A| is the size of the alphabet.

Here is a freely available PDF of a paper by Donald Benson describing how it works:

Fourier methods for biosequence analysis (Donald Benson, Nucleic Acids Research 1990 vol. 18, pp. 3001-3006)

Summary: Convert each of your strings S and L into several indicator vectors (one per character, so 4 in the case of DNA), and then convolve corresponding vectors to determine match counts for each possible alignment. The trick is that convolution in the "time" domain, which ordinarily requires O(n^2) time, can be implemented using multiplication in the "frequency" domain, which requires just O(n) time, plus the time required to convert between domains and back again. Using the FFT each conversion takes just O(nlog n) time, so the overall time complexity is O(|A|nlog n). For greatest speed, finite field FFTs are used, which require only integer arithmetic.

Note: For arbitrary S and L this algorithm is clearly a huge performance win over the straightforward O(mn) algorithm as |S| and |L| become large, but OTOH if S is typically shorter than log|L| (e.g. when querying a large DB with a small sequence), then obviously this approach provides no speedup.

UPDATE 21/7/2009: Updated to mention that the time complexity also depends linearly on the size of the alphabet, since a separate pair of indicator vectors must be used for each character in the alphabet.

174

answered Oct 02 '22 11:10

j_random_hacker

Related questions
                            
                                Best practice - load a lot of stuff in the application_start? [closed]
                            
                                Why access volatile variable is about 100 slower than member?
                            
                                How to check my server's upload and download speed? [closed]
                            
                                Who is faster: PEG or GLR?
                            
                                Which operator is faster (> or >=), (< or <=)? [closed]
                            
                                Why doesn’t __builtin_prefetch have any effect here?
                            
                                Does using design patterns makes java code slow? [closed]
                            
                                How sleep eats CPU php [closed]
                            
                                How to fast initialize with 1 really big array
                            
                                Fastest way to sort multiple lists - Python
                            
                                INNER JOIN condition in WHERE clause or ON clause?
                            
                                Struct or class for Matrix 4x4 object
                            
                                Efficiency of appending to vectors
                            
                                Pandas: fastest way to check if words in Series A endswith one word of Series B
                            
                                What is the fastest way to count elements in an array?
                            
                                VS2008 binary 3x times slower than VS2005?
                            
                                Function to determine whether a poker hand is a straight?
                            
                                Release vs Debug Build Times
                            
                                php accelerator review
                            
                                SQL Server Index performance - long column

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Fastest way to find minimal Hamming distance to any substring?

Tags:

performance

string

algorithm

dsimcha

People also ask

1 Answers

j_random_hacker

Recent Activity

Donate For Us