Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fast Hamming distance scoring

There is a database with N fixed length strings. There is a query string of the same length. The problem is to fetch first k strings from the database that have the smallest Hamming distance to q.

N is small (about 400), strings are long, fixed in length. Database doesn't change, so we can pre-compute indexes. Queries vary strongly, caching and/or pre-computation is not an option. There are lots of them per second. We need always k results, even if k-1 results have match 0 (sorting on Hamming distance and take first k, so locality sensitive hashing and similar approaches won't do). kd-tree and similar space partitioning will probably perform worser than linear search (strings can be very long). BK-tree is currently best choice, but it is still slow and complicated than it needs to be.

It feels like there is an algorithm, which will build an index, which will discard most of the entries in very few steps, leaving k <= t << N entries to compute real Hamming distance.

People suggesting fuzzy string matching based on Levenstein distance - thanks, but the problem is much simpler. Generalized distance metric based approaches (like BK-trees) are good, but maybe there something utilizing the facts described above (small DB/long fixed size strings, simple Hamming distance)

Links, keywords, papers, ideas? =)

like image 815
Sardar Avatar asked Jun 22 '10 23:06

Sardar


People also ask

How is Hamming distance calculated?

Calculation of Hamming Distance In order to calculate the Hamming distance between two strings, and , we perform their XOR operation, (a⊕ b), and then count the total number of 1s in the resultant string.

What is the Hamming distance between 10101 and 11110?

The Hamming distance d(10101, 11110) is 3 because 10101 ⊕ 11110 is 01011 (three 1s).

What is maximum Hamming distance?

Hamming distance between two arrays or strings of equal length is the number of positions at which the corresponding characters (elements) are different. Explanation: Possible rotations of given array = 4 1 1 and 1 1 4. In each case the hamming distance is 2. Therefore the maximum hamming distance will be 2.


1 Answers

This seems like a task where a Vantage Point (VP tree) might work... since the hamming distance should satisfy the triangle inequality theorem, you should be able to apply it... its also good for identifying k-closest. I've seen it in image indexing database setups... you might check section 5 of this paper as an example of what I'm talking about (albeit in a different field).

like image 82
tbischel Avatar answered Oct 12 '22 01:10

tbischel