What's a fast way to sort a given set of images by their similarity to each other.
At the moment I have a system that does histogram analysis between two images, but this is a very expensive operation and seems too overkill.
Optimally I am looking for a algorithm that would give each image a score (for example a integer score, such as the RGB Average) and I can just sort by that score. Identical Scores or scores next to each other are possible duplicates.
0299393 0599483 0499994 <- possible dupe 0499999 <- possible dupe 1002039 4995994 6004994
RGB Average per image sucks, is there something similar?
Among the visual data, there are a lot of near-duplicate images, which are usually defined as the images derived from the same digital source by various copy attacks or the ones captured from the same scene by different cameras and/or different conditions.
Near-duplicate detection is the task to identify and organize documents that are “nearly identical” to each other. In another word, near-duplicates originated from the same reference copy.
There is no such thing as a duplicate content penalty. You will never see a notification from Google Search Console that you have been penalized for duplicate content. But that doesn't mean your site isn't being penalized for having the same or similar content on multiple pages or even multiple sites.
The Google picture search is the most widely used image search engine due to its extensive database that contains billions of images uploaded over the web. It is best to use image search Google when your aim is to find identical pictures against your queried image.
There has been a lot of research on image searching and similarity measures. It's not an easy problem. In general, a single int
won't be enough to determine if images are very similar. You'll have a high false-positive rate.
However, since there has been a lot of research done, you might take a look at some of it. For example, this paper (PDF) gives a compact image fingerprinting algorithm that is suitable for finding duplicate images quickly and without storing much data. It seems like this is the right approach if you want something robust.
If you're looking for something simpler, but definitely more ad-hoc, this SO question has a few decent ideas.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With