Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficiently find binary strings with low Hamming distance in large set

Problem:

Given a large (~100 million) list of unsigned 32-bit integers, an unsigned 32-bit integer input value, and a maximum Hamming Distance, return all list members that are within the specified Hamming Distance of the input value.

Actual data structure to hold the list is open, performance requirements dictate an in-memory solution, cost to build the data structure is secondary, low cost to query the data structure is critical.

Example:

For a maximum Hamming Distance of 1 (values typically will be quite small)  And input:  00001000100000000000000001111101  The values: 01001000100000000000000001111101  00001000100000000010000001111101   should match because there is only 1 position in which the bits are different.  11001000100000000010000001111101  should not match because 3 bit positions are different. 

My thoughts so far:

For the degenerate case of a Hamming Distance of 0, just use a sorted list and do a binary search for the specific input value.

If the Hamming Distance would only ever be 1, I could flip each bit in the original input and repeat the above 32 times.

How can I efficiently (without scanning the entire list) discover list members with a Hamming Distance > 1.

like image 986
Eric J. Avatar asked Jun 17 '11 18:06

Eric J.


People also ask

What is the importance of calculating minimum Hamming distance?

The minimum Hamming distance is used to define some essential notions in coding theory, such as error detecting and error correcting codes. In particular, a code C is said to be k error detecting if, and only if, the minimum Hamming distance between any two of its codewords is at least k+1.

What is the difference between Hamming distance & minimum Hamming distance?

The Hamming distance between two codewords is defined as the number of elements in which they differ. The minimum distance dmin of a linear block code is the smallest Hamming distance between any two different codewords, and is equal to the minimum Hamming weight of the non-zero codewords in the code.

How do you find the minimum Hamming distance of a code?

To calculate the Hamming distance, you simply count the number of bits where two same-length messages differ. An example of Hamming distance 1 is the distance between 1101 and 1001 . If you increase the distance to 2 , we can give as an example 1001 and 1010 .

What is the minimum Hamming distance for a system detecting 3 errors?

101 ⊕ 111 = 010, d(011, 111) = 1. Hence, the Minimum Hamming Distance, dmin = 1.


2 Answers

Question: What do we know about the Hamming distance d(x,y)?

Answer:

  1. It is non-negative: d(x,y) ≥ 0
  2. It is only zero for identical inputs: d(x,y) = 0 ⇔ x = y
  3. It is symmetric: d(x,y) = d(y,x)
  4. It obeys the triangle inequality, d(x,z) ≤ d(x,y) + d(y,z)

Question: Why do we care?

Answer: Because it means that the Hamming distance is a metric for a metric space. There are algorithms for indexing metric spaces.

  • Metric tree (Wikipedia)
  • BK-tree (Wikipedia)
  • M-tree (Wikipedia)
  • VP-tree (Wikipedia)
  • Cover tree (Wikipedia)

You can also look up algorithms for "spatial indexing" in general, armed with the knowledge that your space is not Euclidean but it is a metric space. Many books on this subject cover string indexing using a metric such as the Hamming distance.

Footnote: If you are comparing the Hamming distance of fixed width strings, you may be able to get a significant performance improvement by using assembly or processor intrinsics. For example, with GCC (manual) you do this:

static inline int distance(unsigned x, unsigned y) {     return __builtin_popcount(x^y); } 

If you then inform GCC that you are compiling for a computer with SSE4a, then I believe that should reduce to just a couple opcodes.

Edit: According to a number of sources, this is sometimes/often slower than the usual mask/shift/add code. Benchmarking shows that on my system, a C version outperform's GCC's __builtin_popcount by about 160%.

Addendum: I was curious about the problem myself, so I profiled three implementations: linear search, BK tree, and VP tree. Note that VP and BK trees are very similar. The children of a node in a BK tree are "shells" of trees containing points that are each a fixed distance from the tree's center. A node in a VP tree has two children, one containing all the points within a sphere centered on the node's center and the other child containing all the points outside. So you can think of a VP node as a BK node with two very thick "shells" instead of many finer ones.

The results were captured on my 3.2 GHz PC, and the algorithms do not attempt to utilize multiple cores (which should be easy). I chose a database size of 100M pseudorandom integers. Results are the average of 1000 queries for distance 1..5, and 100 queries for 6..10 and the linear search.

  • Database: 100M pseudorandom integers
  • Number of tests: 1000 for distance 1..5, 100 for distance 6..10 and linear
  • Results: Average # of query hits (very approximate)
  • Speed: Number of queries per second
  • Coverage: Average percentage of database examined per query
                 -- BK Tree --   -- VP Tree --   -- Linear -- Dist    Results Speed   Cov     Speed   Cov     Speed   Cov 1          0.90 3800     0.048% 4200     0.048% 2         11     300     0.68%   330     0.65% 3        130      56     3.8%     63     3.4% 4        970      18    12%       22    10% 5       5700       8.5  26%       10    22% 6       2.6e4      5.2  42%        6.0  37% 7       1.1e5      3.7  60%        4.1  54% 8       3.5e5      3.0  74%        3.2  70% 9       1.0e6      2.6  85%        2.7  82% 10      2.5e6      2.3  91%        2.4  90% any                                             2.2     100% 

In your comment, you mentioned:

I think BK-trees could be improved by generating a bunch of BK-trees with different root nodes, and spreading them out.

I think this is exactly the reason why the VP tree performs (slightly) better than the BK tree. Being "deeper" rather than "shallower", it compares against more points rather than using finer-grained comparisons against fewer points. I suspect that the differences are more extreme in higher dimensional spaces.

A final tip: leaf nodes in the tree should just be flat arrays of integers for a linear scan. For small sets (maybe 1000 points or fewer) this will be faster and more memory efficient.

like image 57
Dietrich Epp Avatar answered Oct 13 '22 04:10

Dietrich Epp


I wrote a solution where I represent the input numbers in a bitset of 232 bits, so I can check in O(1) whether a certain number is in the input. Then for a queried number and maximum distance, I recursively generate all numbers within that distance and check them against the bitset.

For example for maximum distance 5, this is 242825 numbers (sumd = 0 to 5 {32 choose d}). For comparison, Dietrich Epp's VP-tree solution for example goes through 22% of the 100 million numbers, i.e., through 22 million numbers.

I used Dietrich's code/solutions as the basis to add my solution and compare it with his. Here are speeds, in queries per second, for maximum distances up to 10:

Dist     BK Tree     VP Tree         Bitset   Linear     1   10,133.83   15,773.69   1,905,202.76   4.73    2      677.78    1,006.95     218,624.08   4.70    3      113.14      173.15      27,022.32   4.76    4       34.06       54.13       4,239.28   4.75    5       15.21       23.81         932.18   4.79    6        8.96       13.23         236.09   4.78    7        6.52        8.37          69.18   4.77    8        5.11        6.15          23.76   4.68    9        4.39        4.83           9.01   4.47   10        3.69        3.94           2.82   4.13  Prepare     4.1s       21.0s          1.52s  0.13s times (for building the data structure before the queries) 

For small distances, the bitset solution is by far the fastest of the four. Question author Eric commented below that the largest distance of interest would probably be 4-5. Naturally, my bitset solution becomes slower for larger distances, even slower than the linear search (for distance 32, it would go through 232 numbers). But for distance 9 it still easily leads.

I also modified Dietrich's testing. Each of the above results is for letting the algorithm solve at least three queries and as many queries as it can in about 15 seconds (I do rounds with 1, 2, 4, 8, 16, etc queries, until at least 10 seconds have passed in total). That's fairly stable, I even get similar numbers for just 1 second.

My CPU is an i7-6700. My code (based on Dietrich's) is here (ignore the documentation there at least for now, not sure what to do about that, but the tree.c contains all the code and my test.bat shows how I compiled and ran (I used the flags from Dietrich's Makefile)). Shortcut to my solution.

One caveat: My query results contain numbers only once, so if the input list contains duplicate numbers, that may or may not be desired. In question author Eric's case, there were no duplicates (see comment below). In any case, this solution might be good for people who either have no duplicates in the input or don't want or need duplicates in the query results (I think it's likely that the pure query results are only a means to an end and then some other code turns the numbers into something else, for example a map mapping a number to a list of files whose hash is that number).

like image 25
Stefan Pochmann Avatar answered Oct 13 '22 06:10

Stefan Pochmann