Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Hamming distance on binary strings in SQL

Tags:

I have a table in my DB where I store SHA256 hashes in a BINARY(32) column. I'm looking for a way to compute the Hamming distance of the entries in the column to a supplied value, i.e. something like:

SELECT * FROM table    ORDER BY HAMMINGDISTANCE(hash, UNHEX(<insert supplied sha256 hash here>)) ASC    LIMIT 10 

(in case you're wondering, the Hamming distance of strings A and B is defined as BIT_COUNT(A^B), where ^ is the bitwise XOR operator and BIT_COUNT returns the number of 1s in the binary string).

Now, I know that both the ^ operator and BIT_COUNT function only work on INTEGERs and so I'd say that probably the only way to do it would be to break up the binary strings in substrings, cast each binary substring to integer, compute the Hamming distance substring-wise and then add them. The problem with this is that it sounds terribly complicated, not efficient and definitely not elegant. My question therefore is: could you suggest any better way? (please note that I'm on shared hosting and therefore I can't modify the DB server or load libraries)

edit(1): Obviously loading the whole table in PHP and doing the computations there would be possible but I'd rather avoid it because this table will probably grow quite large.

edit(2): The DB server is MySQL 5.1

edit(3): My answer below contains the code that I just described above.

edit(4): I just found out that using 4 BIGINTs to store the hash instead of a BINARY(32) yields massive speed improvements (more than 100 times faster). See the comments to my answer below.

like image 837
CAFxX Avatar asked Jan 23 '11 22:01

CAFxX


People also ask

How do you calculate Hamming distance in a string?

To calculate the Hamming distance, you simply count the number of bits where two same-length messages differ. An example of Hamming distance 1 is the distance between 1101 and 1001 . If you increase the distance to 2 , we can give as an example 1001 and 1010 .

What is the Hamming distance between two strings?

In information theory, the Hamming distance between two strings of equal length is the number of positions at which the corresponding symbols are different.

What is the Hamming distance between 1010 and 0110 *?

010 ⊕ 011 = 001, d(010, 011) = 1. 010 ⊕ 101 = 111, d(010, 101) = 3.

What is the Hamming distance between the data?

A Hamming distance in information technology represents the number of points at which two corresponding pieces of data can be different. It is often used in various kinds of error correction or evaluation of contrasting strings or pieces of data.


1 Answers

It appears that storing the data in a BINARY column is an approach bound to perform poorly. The only fast way to get decent performance is to split the content of the BINARY column in multiple BIGINT columns, each containing an 8-byte substring of the original data.

In my case (32 bytes) this would mean using 4 BIGINT columns and using this function:

CREATE FUNCTION HAMMINGDISTANCE(   A0 BIGINT, A1 BIGINT, A2 BIGINT, A3 BIGINT,    B0 BIGINT, B1 BIGINT, B2 BIGINT, B3 BIGINT ) RETURNS INT DETERMINISTIC RETURN    BIT_COUNT(A0 ^ B0) +   BIT_COUNT(A1 ^ B1) +   BIT_COUNT(A2 ^ B2) +   BIT_COUNT(A3 ^ B3); 

Using this approach, in my testing, is over 100 times faster than using the BINARY approach.


FWIW, this is the code I was hinting at while explaining the problem. Better ways to accomplish the same thing are welcome (I especially don't like the binary > hex > decimal conversions):

CREATE FUNCTION HAMMINGDISTANCE(A BINARY(32), B BINARY(32)) RETURNS INT DETERMINISTIC RETURN    BIT_COUNT(     CONV(HEX(SUBSTRING(A, 1,  8)), 16, 10) ^      CONV(HEX(SUBSTRING(B, 1,  8)), 16, 10)   ) +   BIT_COUNT(     CONV(HEX(SUBSTRING(A, 9,  8)), 16, 10) ^      CONV(HEX(SUBSTRING(B, 9,  8)), 16, 10)   ) +   BIT_COUNT(     CONV(HEX(SUBSTRING(A, 17, 8)), 16, 10) ^      CONV(HEX(SUBSTRING(B, 17, 8)), 16, 10)   ) +   BIT_COUNT(     CONV(HEX(SUBSTRING(A, 25, 8)), 16, 10) ^      CONV(HEX(SUBSTRING(B, 25, 8)), 16, 10)   ); 
like image 132
CAFxX Avatar answered Sep 16 '22 17:09

CAFxX