If you have n
binary strings, each of length m
, is there a faster way to determine the minimum Hamming distance between any pair than to compare all O(n^2)
pairs and for each to compute their Hamming distance?
That is can it be done in less than
O(n^2m)
time?
Apart from anything else and as commented below, the Hamming distance is a proper distance function and so satisfies the triangle inequality, which makes me feel there should be a faster solution.
Consider using Locality Sensitive Hashing, which is a general technique that can be applied to certain distance metrics including Hamming distance. Excerpt from Wikipedia:
LSH hashes input items so that similar items map to the same “buckets” with high probability (the number of buckets being much smaller than the universe of possible input items).
In short, you can use LSH to obtain buckets, brute-force Hamming distances within each bucket, and output the smallest distance found. To obtain the right answer with higher probability, you can tweak parameters of the LSH algorithm and/or run LSH multiple times (to get different allocations of items to buckets). I believe you can get arbitrarily close to the correct (optimal) answer with a failure rate exponentially-decreasing in runtime. (You might have to binary search over the LSH parameters, if your Hamming distances are all very close, but you'll still avoid computing n^2
Hamming distances.)
The algorithm and analysis are pretty involved, so I don't think I can write a complete summary here at the moment (it's about 2-3 hours worth of lecture material). I recommend taking a look at the lecture notes/slides here, here, and here; they all cover LSH (in varying degrees of detail) with some mention of Hamming distance.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With