I have a table A which has a column 'template_phash'. I store the phash generated from 400K images.
Now I take a random image and generate a phash from that image.
Now how do I query so that I can get the record from table A which hamming distance difference is less than a threshold value, say 20.
I have seen Hamming distance on binary strings in SQL, but couldn't figure it out.
I think I figured out that I need to make a function to achieve this but how?
Both of my phash are in BigInt eg: 7641692061273169067
Please help me make the function so that I could query like
SELECT product_id, HAMMING_DISTANCE(phash1, phash2) as hd
FROM A
WHERE hd < 20 ORDER BY hd ASC;
To calculate the Hamming distance, you simply count the number of bits where two same-length messages differ. An example of Hamming distance 1 is the distance between 1101 and 1001 . If you increase the distance to 2 , we can give as an example 1001 and 1010 .
In information theory, the Hamming distance between two strings of equal length is the number of positions at which the corresponding symbols are different.
Definition 1 (Hamming distance) Given two vectors u,v ∈ Fn we define the hamming distance between u and v, d(u,v), to be the number of places where u and v differ. Thus the Hamming distance between two vectors is the number of bits we must change to change one into the other.
Hamming distance is a metric for comparing two binary data strings. While comparing two binary strings of equal length, Hamming distance is the number of bit positions in which the two bits are different. The Hamming distance between two strings, a and b is denoted as d(a,b).
I figured out that the hamming distance is just the count of different bits between the two hashes. First xor the two hashes then get the count of binary ones:
SELECT product_id, BIT_COUNT(phash1 ^ phash2) as hd from A ORDER BY hd ASC;
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With