Algorithm for finding similar images using an index

Tags:

There are some surprisingly good image compare tools which find similar image even if it's not exactly the same (eg. change in size, wallpaper, brightness/contrast). I have some example applications here:

Unique Filer 1.4 (shareware): https://web.archive.org/web/20010309014927/http://uniquefiler.com/
Fast Duplicate File Finder (Freeware): http://www.mindgems.com/products/Fast-Duplicate-File-Finder/Fast-Duplicate-File-Finder-About.htm
Visual similarity duplicate image finder (payware): http://www.mindgems.com/products/VS-Duplicate-Image-Finder/VSDIF-About.htm
Duplicate Checker (payware): http://www.duplicatechecker.com/

I only tried the first one, but all of them are developed for Windows and are not open source. Unique Filer was released in 2000 and the homepage seems to have disappeared. It was surprisingly fast (even on computers from that year) because it used an index and comparing some 10000 images using the index needed only some few seconds (and updating the index was a scalable process).

Since this algorithm in a very effective form already exists for at least 15 years, I assume it is well-documented and possibly already implemented as an open source library. Does anyone knows more about which algorithm or image detection theory was used to implement this applications? Maybe there is even a open source implementation of it available?

I already checked the question Algorithm for finding similar images but all of it's answers solve the problem by comparing one image to another. For 1000+ images this will result in 1000^2 comparing operations which is just not what I'm looking for.

379

asked Aug 03 '14 14:08

Daniel Alder

2 Answers

The problem you are describing is more generally called Nearest Neighbor Search. Since you are asking for high efficiency on large datasets, Approximated Nearest Neighbor Search is what you are after.

An efficient technique for this is Locality-Sensitive Hashing (LSH), for which these slides give a great overview. Its basic idea is the use of hashing functions which project all data to a low-dimensional space, with the constraint that the hash of similar data collides with a high probability and dissimilar data collides with low probability. These probabilities are parameters to the algorithm, with which the trade-off between accuracy and efficiency can be changed.

LSHKIT is an open-source implementation of LSH.

126

answered Nov 04 '22 06:11

Florian von Stosch

Meanwhile, I analyzed the algorithm of UniqueFiler:

size reduction

First, it reduces all images to 10x10 pixel grayscale images (likely without using interpolation)

rotation

Probably based on the brightness of the 4 quadrants, some rotation is done (this step is dangerous because it sometimes 'overlooks' similarities if images are too symmetric)

range reduction

The image brightness range is fully extended (brightest -> white, darkest -> black) and then reduced to 2 bit (4 values) per pixel

database

The values get stored as arrays of 100 bytes per image (plus file metadata)

comparison

... is done one-by-one (two nested loops over the whole database plus a third for the 100 bytes). Today, we would probably index the sorted sums of all 4 quadrants for a fast pre-selection of similar candidates.

matcher

The comparison is done byte-by-byte by difference between each two bytes, weighted but less than the square. The sum of these 100 results is the final difference between two images.

I have more detailed information a home. If I find the time, I will add them to this answer. I found this after I discovered that the database format is actually a gzipped file without header, containing fixed-sized records per image

answered Nov 04 '22 06:11

Daniel Alder

Related questions
                            
                                Optimize: Divide an array into continuous subsequences of length no greater than k such that sum of maximum value of each subsequence is minimum
                            
                                find the greatest sum of continuous subset, with different conditions
                            
                                Build stateful chain for different events and assign global ID in spark
                            
                                Given a node, how long will it take to burn the whole binary tree?
                            
                                Maximize minimum distance between arrays
                            
                                B-Tree - Why can't there be a node with an even number of keys?
                            
                                Algorithm to find optimal groups
                            
                                Uses of self referencing lists
                            
                                Product Naming Algorithm
                            
                                Josephus for large n (Facebook Hacker Cup)
                            
                                Lossless hierarchical run length encoding
                            
                                Word search algorithm
                            
                                hidden markov model thresholding
                            
                                Image comparison algorithm that ignores brightness
                            
                                How can I adapt the Levenshtein Distance algorithm to limit matches to a single word?
                            
                                Fuzzy Bit Matching
                            
                                Removing "almost duplicate" strings in subquadratic time
                            
                                Using circular permutations to reduce Traveling Salesman complexity
                            
                                String Matching: Computing the longest prefix suffix array in kmp algorithm
                            
                                Changing data structure representation at runtime: looking for other examples

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Algorithm for finding similar images using an index

Tags:

algorithm

image-processing

similarity

Daniel Alder

People also ask

2 Answers

Florian von Stosch

size reduction

rotation

range reduction

database

comparison

matcher

Daniel Alder

Recent Activity

Donate For Us