Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Hamming Distance optimization for MySQL or PostgreSQL?

I trying to improve search similar images pHashed in MySQL database. Right now I comparing pHash counting hamming distance like this:

SELECT * FROM images WHERE BIT_COUNT(hash ^ 2028359052535108275) <= 4

Results for selecting (engine MyISAM)

  • 20000 rows ; query time < 20ms
  • 100000 rows ; query time ~ 60ms # this was just fine, until its reached 150000 rows
  • 300000 rows ; query time ~ 150ms

So query time encrease depends of the number of rows in table.


I also try solutions found on stackoverflow Hamming distance on binary strings in SQL

SELECT * FROM images WHERE 
BIT_COUNT(h1 ^ 11110011) + 
BIT_COUNT(h2 ^ 10110100) + 
BIT_COUNT(h3 ^ 11001001) + 
BIT_COUNT(h4 ^ 11010001) + 
BIT_COUNT(h5 ^ 00100011) + 
BIT_COUNT(h6 ^ 00010100) + 
BIT_COUNT(h7 ^ 00011111) + 
BIT_COUNT(h8 ^ 00001111) <= 4

rows 300000 ; query time ~ 240ms


I changed database engine to PostgreSQL. Translate this MySQL query to PyGreSQL Without success. rows 300000 ; query time ~ 18s


Is there any solution to optimize above queries? I mean optimization not depended of the number of rows.

I have limited ways (tools) to solve this problem. MySQL so far seemed to be the simplest solution but I can deploy code on every open source database engine that will work with Ruby on dedicated machine. There is some ready solutions for MsSQL https://stackoverflow.com/a/5930944/766217 (not tested). Maybe someone know how to translate it for MySQL or PostgreSQL.

Please, post answers based on some code or observations. We have a lot of theoretical issues about hamming distance on stackoverflow.com

Thanks!

like image 553
mateuszdw Avatar asked Feb 17 '13 19:02

mateuszdw


2 Answers

This solution made things a bit faster for me. It makes a derived table for each hash compare, and returns only the results that are less than the ham distance. This way, it's not doing the BIT_COUNT on a pHash that has already exceeded the ham. It returns all matches in about 2.25 seconds on 2.6 million records.

It's InnoDB, and I have very few indexes.

If somebody can make it faster, I'll appreciate you.

SELECT *, BIT_COUNT(pHash3 ^ 42597524) + BC2 AS BC3 
FROM ( 
    SELECT *, BIT_COUNT(pHash2 ^ 258741369) + BC1 AS BC2 
    FROM ( 
        SELECT *, BIT_COUNT(pHash1 ^ 5678910) + BC0 AS BC1 
        FROM ( 
            SELECT `Key`, pHash0, pHash1, pHash2, pHash3, BIT_COUNT(pHash0 ^ 1234567) as BC0 
            FROM files 
            WHERE  BIT_COUNT(pHash0 ^ 1234567) <= 3 
        ) AS BCQ0 
        WHERE BIT_COUNT(pHash1 ^ 5678910) + BC0 <= 3 
    ) AS BCQ1 
    WHERE BIT_COUNT(pHash2 ^ 258741369) + BC1 <= 3 
    ) AS BCQ2 
WHERE BIT_COUNT(pHash3 ^ 42597524) + BC2 <= 3

This is the equivalent query, but without the derived tables. Its return time is almost 3 times as long.

SELECT `Key`, pHash0, pHash1, pHash2, pHash3 
FROM Files 
WHERE BIT_COUNT(pHash0 ^ 1234567) + BIT_COUNT(pHash1 ^ 5678910) + BIT_COUNT(pHash2 ^ 258741369) + BIT_COUNT(pHash3 ^ 42597524) <=3

Keeping in mind that the lower the ham value on the first one, the faster it will run.

like image 29
alfadog67 Avatar answered Oct 12 '22 09:10

alfadog67


When considering the efficiency of algorithms, computer scientists use the concept of the order denoted O(something) where something is a function of n, the number of things being computed, in this case rows. So we get, in increasing time:

  • O(1) - independent of the number of items
  • O(log(n)) - increases as the logarithm of the items
  • O(n) - increases in proportion of the items (what you have)
  • O(n^2) - increases as the square of the items
  • O(n^3) - etc
  • O(2^n) - increases exponentially
  • O(n!) - increases with the factorial of the number

The last 2 are effectively uncomputable for any reasonable number of n (80+).

Only the most significant term matters since this dominates for large n so n^2 and 65*n^2+787*n+4656566 are both O(n^2)

Bearing in mind that this is a mathematical construction and the time an algorithm takes with real software on real hardware using real data may be heavily influenced by other things (e.g. an O(n^2) memory operation may take less time than an O(n) disk operation).

For your problem, you need to run through each row and compute BIT_COUNT(hash ^ 2028359052535108275) <= 4. This is an O(n) operation.

The only way this could be improved is by utilizing an index since a b-tree index retrieval is an O(log(n)) operation.

However, because your column field is contained within a function, an index on that column cannot be used. You have 2 possibilities:

  1. This is an SQL server solution and I don't know if it is portable to MySQL. Create a persisted calculated column in your table with the formula BIT_COUNT(hash ^ 2028359052535108275) and put an index on it. This will not be suitable if you need to change the bit mask.
  2. Work out a way of doing the bitwise arithmetic without using the BIT_COUNT function.
like image 146
Dale M Avatar answered Oct 12 '22 09:10

Dale M