I trying to improve search similar images pHashed in MySQL database. Right now I comparing pHash counting hamming distance like this:
SELECT * FROM images WHERE BIT_COUNT(hash ^ 2028359052535108275) <= 4
Results for selecting (engine MyISAM)
So query time encrease depends of the number of rows in table.
I also try solutions found on stackoverflow Hamming distance on binary strings in SQL
SELECT * FROM images WHERE
BIT_COUNT(h1 ^ 11110011) +
BIT_COUNT(h2 ^ 10110100) +
BIT_COUNT(h3 ^ 11001001) +
BIT_COUNT(h4 ^ 11010001) +
BIT_COUNT(h5 ^ 00100011) +
BIT_COUNT(h6 ^ 00010100) +
BIT_COUNT(h7 ^ 00011111) +
BIT_COUNT(h8 ^ 00001111) <= 4
rows 300000 ; query time ~ 240ms
I changed database engine to PostgreSQL. Translate this MySQL query to PyGreSQL Without success. rows 300000 ; query time ~ 18s
Is there any solution to optimize above queries? I mean optimization not depended of the number of rows.
I have limited ways (tools) to solve this problem. MySQL so far seemed to be the simplest solution but I can deploy code on every open source database engine that will work with Ruby on dedicated machine. There is some ready solutions for MsSQL https://stackoverflow.com/a/5930944/766217 (not tested). Maybe someone know how to translate it for MySQL or PostgreSQL.
Please, post answers based on some code or observations. We have a lot of theoretical issues about hamming distance on stackoverflow.com
Thanks!
This solution made things a bit faster for me. It makes a derived table for each hash compare, and returns only the results that are less than the ham distance. This way, it's not doing the BIT_COUNT on a pHash that has already exceeded the ham. It returns all matches in about 2.25 seconds on 2.6 million records.
It's InnoDB, and I have very few indexes.
If somebody can make it faster, I'll appreciate you.
SELECT *, BIT_COUNT(pHash3 ^ 42597524) + BC2 AS BC3
FROM (
SELECT *, BIT_COUNT(pHash2 ^ 258741369) + BC1 AS BC2
FROM (
SELECT *, BIT_COUNT(pHash1 ^ 5678910) + BC0 AS BC1
FROM (
SELECT `Key`, pHash0, pHash1, pHash2, pHash3, BIT_COUNT(pHash0 ^ 1234567) as BC0
FROM files
WHERE BIT_COUNT(pHash0 ^ 1234567) <= 3
) AS BCQ0
WHERE BIT_COUNT(pHash1 ^ 5678910) + BC0 <= 3
) AS BCQ1
WHERE BIT_COUNT(pHash2 ^ 258741369) + BC1 <= 3
) AS BCQ2
WHERE BIT_COUNT(pHash3 ^ 42597524) + BC2 <= 3
This is the equivalent query, but without the derived tables. Its return time is almost 3 times as long.
SELECT `Key`, pHash0, pHash1, pHash2, pHash3
FROM Files
WHERE BIT_COUNT(pHash0 ^ 1234567) + BIT_COUNT(pHash1 ^ 5678910) + BIT_COUNT(pHash2 ^ 258741369) + BIT_COUNT(pHash3 ^ 42597524) <=3
Keeping in mind that the lower the ham value on the first one, the faster it will run.
When considering the efficiency of algorithms, computer scientists use the concept of the order denoted O(something) where something is a function of n, the number of things being computed, in this case rows. So we get, in increasing time:
The last 2 are effectively uncomputable for any reasonable number of n (80+).
Only the most significant term matters since this dominates for large n so n^2 and 65*n^2+787*n+4656566 are both O(n^2)
Bearing in mind that this is a mathematical construction and the time an algorithm takes with real software on real hardware using real data may be heavily influenced by other things (e.g. an O(n^2) memory operation may take less time than an O(n) disk operation).
For your problem, you need to run through each row and compute BIT_COUNT(hash ^ 2028359052535108275) <= 4
. This is an O(n) operation.
The only way this could be improved is by utilizing an index since a b-tree index retrieval is an O(log(n)) operation.
However, because your column field is contained within a function, an index on that column cannot be used. You have 2 possibilities:
BIT_COUNT(hash ^ 2028359052535108275)
and put an index on it. This will not be suitable if you need to change the bit mask.If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With