Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Near-Duplicate Image Detection [closed]

What's a fast way to sort a given set of images by their similarity to each other.

At the moment I have a system that does histogram analysis between two images, but this is a very expensive operation and seems too overkill.

Optimally I am looking for a algorithm that would give each image a score (for example a integer score, such as the RGB Average) and I can just sort by that score. Identical Scores or scores next to each other are possible duplicates.

0299393 0599483 0499994 <- possible dupe 0499999 <- possible dupe 1002039 4995994 6004994  

RGB Average per image sucks, is there something similar?

like image 958
The Unknown Avatar asked Jun 23 '09 20:06

The Unknown


People also ask

What is near-duplicate image?

Among the visual data, there are a lot of near-duplicate images, which are usually defined as the images derived from the same digital source by various copy attacks or the ones captured from the same scene by different cameras and/or different conditions.

What are the near-duplicate detection in information retrieval?

Near-duplicate detection is the task to identify and organize documents that are “nearly identical” to each other. In another word, near-duplicates originated from the same reference copy.

Does Google penalize duplicate images?

There is no such thing as a duplicate content penalty. You will never see a notification from Google Search Console that you have been penalized for duplicate content. But that doesn't mean your site isn't being penalized for having the same or similar content on multiple pages or even multiple sites.

How do I find similar images?

The Google picture search is the most widely used image search engine due to its extensive database that contains billions of images uploaded over the web. It is best to use image search Google when your aim is to find identical pictures against your queried image.


1 Answers

There has been a lot of research on image searching and similarity measures. It's not an easy problem. In general, a single int won't be enough to determine if images are very similar. You'll have a high false-positive rate.

However, since there has been a lot of research done, you might take a look at some of it. For example, this paper (PDF) gives a compact image fingerprinting algorithm that is suitable for finding duplicate images quickly and without storing much data. It seems like this is the right approach if you want something robust.

If you're looking for something simpler, but definitely more ad-hoc, this SO question has a few decent ideas.

like image 190
Naaff Avatar answered Oct 29 '22 16:10

Naaff