I have a process, similar to tineye that generates perceptual hashes, these are 32bit ints. I intend to store these in a sql database (maybe a nosql db) in the future However, I'm stumped at how I would be able to retrieve records based on the similarity of hashes. Any Ideas?

A common approach (at least common to me) is to divide your hash bit string in several chunks and query on these chunks for an exact match. This is a "pre-filter" step. You then can perform a bitwise hamming distance computation on the returned results which should be only a smaller subset of your overall dataset. This can be done using data files or SQL tables. So in simple terms: Say you have a bunch of 32 bits hashes in a DB and that you want to find every hash that are within a 4 bits hamming distance of your "query" hash: <ol> <li>create a table with four columns: each will contain an 8 bits (as a string or int) slice of the 32 bits hashes, islice 1 to 4.</li> <li>slice your query hash the same way in qslice 1 to 4.</li> <li>query this table such that any of <code>qslice1=islice1 or qslice2=islice2 or qslice3=islice3 or qslice4=islice4</code>. This gives you every DB hash that are within 3 bits (<code>4 - 1</code>) of the query hash.</li> <li>for each returned hash, compute the exact hamming distance pair-wise with you query hash (reconstructing the index-side hash from the four slices)</li> </ol> The number of operations in step 4 should be much less than a full pair-wise hamming computation of your whole table. This approach was first described afaik by Moses Charikar in its "simhash" seminal paper and the corresponding Google patent: <blockquote> <ol start="5"> <li>APPROXIMATE NEAREST NEIGHBOR SEARCH IN HAMMING SPACE</li> </ol> [...] Given bit vectors consisting of d bits each, we choose N = O(n 1/(1+ ) ) random permutations of the bits. For each random permutation σ, we maintain a sorted order O σ of the bit vectors, in lexicographic order of the bits permuted by σ. Given a query bit vector q, we find the approximate nearest neighbor by doing the following: For each permutation σ, we perform a binary search on O σ to locate the two bit vectors closest to q (in the lexicographic order obtained by bits permuted by σ). We now search in each of the sorted orders O σ examining elements above and below the position returned by the binary search in order of the length of the longest prefix that matches q. </blockquote> Monika Henziger expanded on this in her paper "Finding near-duplicate web pages: a large-scale evaluation of algorithms": <blockquote> 3.3 The Results for Algorithm C We partitioned the bit string of each page into 12 non- overlapping 4-byte pieces, creating 20B pieces, and computed the C-similarity of all pages that had at least one piece in common. This approach is guaranteed to find all pairs of pages with difference up to 11, i.e., C-similarity 373, but might miss some for larger differences. </blockquote> This is also explained in the paper Detecting Near-Duplicates for Web Crawling by Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma: <blockquote> <ol start="3"> <li>THE HAMMING DISTANCE PROBLEM</li> </ol> Definition: Given a collection of f -bit fingerprints and a query fingerprint F, identify whether an existing fingerprint differs from F in at most k bits. (In the batch-mode version of the above problem, we have a set of query fingerprints instead of a single query fingerprint) [...] Intuition: Consider a sorted table of 2 d f -bit truly random fingerprints. Focus on just the most significant d bits in the table. A listing of these d-bit numbers amounts to “almost a counter” in the sense that (a) quite a few 2 d bit- combinations exist, and (b) very few d-bit combinations are duplicated. On the other hand, the least significant f − d bits are “almost random”. Now choose d such that |d − d| is a small integer. Since the table is sorted, a single probe suffices to identify all fingerprints which match F in d most significant bit-positions. Since |d − d| is small, the number of such matches is also expected to be small. For each matching fingerprint, we can easily figure out if it differs from F in at most k bit-positions or not (these differences would naturally be restricted to the f − d least-significant bit-positions). The procedure described above helps us locate an existing fingerprint that differs from F in k bit-positions, all of which are restricted to be among the least significant f − d bits of F. This takes care of a fair number of cases. To cover all the cases, it suffices to build a small number of additional sorted tables, as formally outlined in the next Section. </blockquote> PS: Most of these fine brains are/were associated with Google at some level or some time for these, FWIW.

Hamming Distance / Similarity searches in a database

1 Answers

A common approach (at least common to me) is to divide your hash bit string in several chunks and query on these chunks for an exact match. This is a "pre-filter" step. You then can perform a bitwise hamming distance computation on the returned results which should be only a smaller subset of your overall dataset. This can be done using data files or SQL tables.

So in simple terms: Say you have a bunch of 32 bits hashes in a DB and that you want to find every hash that are within a 4 bits hamming distance of your "query" hash:

create a table with four columns: each will contain an 8 bits (as a string or int) slice of the 32 bits hashes, islice 1 to 4.
slice your query hash the same way in qslice 1 to 4.
query this table such that any of qslice1=islice1 or qslice2=islice2 or qslice3=islice3 or qslice4=islice4. This gives you every DB hash that are within 3 bits (4 - 1) of the query hash.
for each returned hash, compute the exact hamming distance pair-wise with you query hash (reconstructing the index-side hash from the four slices)

The number of operations in step 4 should be much less than a full pair-wise hamming computation of your whole table.

This approach was first described afaik by Moses Charikar in its "simhash" seminal paper and the corresponding Google patent:

APPROXIMATE NEAREST NEIGHBOR SEARCH IN HAMMING SPACE

[...]

Given bit vectors consisting of d bits each, we choose N = O(n 1/(1+ ) ) random permutations of the bits. For each random permutation σ, we maintain a sorted order O σ of the bit vectors, in lexicographic order of the bits permuted by σ. Given a query bit vector q, we find the approximate nearest neighbor by doing the following:

For each permutation σ, we perform a binary search on O σ to locate the two bit vectors closest to q (in the lexicographic order obtained by bits permuted by σ). We now search in each of the sorted orders O σ examining elements above and below the position returned by the binary search in order of the length of the longest prefix that matches q.

Monika Henziger expanded on this in her paper "Finding near-duplicate web pages: a large-scale evaluation of algorithms":

3.3 The Results for Algorithm C

We partitioned the bit string of each page into 12 non- overlapping 4-byte pieces, creating 20B pieces, and computed the C-similarity of all pages that had at least one piece in common. This approach is guaranteed to find all pairs of pages with difference up to 11, i.e., C-similarity 373, but might miss some for larger differences.

This is also explained in the paper Detecting Near-Duplicates for Web Crawling by Gurmeet Singh Manku, Arvind Jain, and Anish Das Sarma:

THE HAMMING DISTANCE PROBLEM

Definition: Given a collection of f -bit fingerprints and a query fingerprint F, identify whether an existing fingerprint differs from F in at most k bits. (In the batch-mode version of the above problem, we have a set of query fingerprints instead of a single query fingerprint)

[...]

Intuition: Consider a sorted table of 2 d f -bit truly random fingerprints. Focus on just the most significant d bits in the table. A listing of these d-bit numbers amounts to “almost a counter” in the sense that (a) quite a few 2 d bit- combinations exist, and (b) very few d-bit combinations are duplicated. On the other hand, the least significant f − d bits are “almost random”.

Now choose d such that |d − d| is a small integer. Since the table is sorted, a single probe suffices to identify all fingerprints which match F in d most significant bit-positions. Since |d − d| is small, the number of such matches is also expected to be small. For each matching fingerprint, we can easily figure out if it differs from F in at most k bit-positions or not (these differences would naturally be restricted to the f − d least-significant bit-positions).

The procedure described above helps us locate an existing fingerprint that differs from F in k bit-positions, all of which are restricted to be among the least significant f − d bits of F. This takes care of a fair number of cases. To cover all the cases, it suffices to build a small number of additional sorted tables, as formally outlined in the next Section.

PS: Most of these fine brains are/were associated with Google at some level or some time for these, FWIW.

answered Sep 21 '22 01:09

Philippe Ombredanne

Related questions
                            
                                MySQL query to search a field with JSON string
                            
                                What is the Fastest Way to Select a Whole Table in SQL Server?
                            
                                how we can select two columns having different collation
                            
                                Inserting data multiple times into an SQL table, from a single button on Windows Forms
                            
                                MySQL - ordering by something not ASC or DESC
                            
                                How to update a column with a "blank" value
                            
                                Rails: mixing NOSQL & SQL Databases
                            
                                Ignore Foreign Key constraints in batch insert
                            
                                MySQL, which is more efficient longtext, text, or blob? Improving insert efficiency
                            
                                Is there a way to specify table name as a string?
                            
                                Can I write a query has a conditional table selection
                            
                                Mysqldump tables from different databases?
                            
                                create table constraint in mysql
                            
                                Substring-indexing in Oracle
                            
                                Preventing sql injection - why should one escape the input if using prepared statements?
                            
                                How to query to get totals for last seven days?
                            
                                SET SCAN OFF purpose
                            
                                How data will load in to memory in sql server
                            
                                SQL Server 2008 - How to update a field with DELETE trigger
                            
                                Will my PHP application work with Heroku's ClearDB MySql resource?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Hamming Distance / Similarity searches in a database

Tags:

sql

search

nosql

similarity

hamming-distance

oPless

People also ask

1 Answers

Philippe Ombredanne

Recent Activity

Donate For Us