Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Identify images with same content in Java

A while ago, I spent some time searching for ways to determine whether two images are identical in order to answer this question. I now face a slightly different problem: I have roughly two thousand images at hand, some of which have the same content, but are scaled/rotated versions of each other (rotations are always by multiples of 90°), along with the problem of different compressions and image formats (mostly jpg, some png, nothing else). The scaling doesn't go beyond roughly 2:1. What I'd like to do is eliminate duplicates while retaining the instance of highest quality. Since Java is the only language in which I'm fairly proficient, I need to use Java.

The answers to a different question offer many useful links, but it doesn't look like any among them can identify duplicates when scaled/rotated.

This question along with the answers suggest first scaling all images to a very small size (say 32*32 or 16*16), then basically doing some hashing, and comparisons based on the hash. This sounds smart enough to me, the images could be pre-sorted before comparison, which would after sorting be an O(n) problem. However, given that the images may be rotated, I'm not sure how to deal with it; one option would be to manually go through all the images and decide on a rotation, given that what they depict has clear orientation (the human eye can very easily decide which way "up" should be). If possible, I'd like to avoid that though.

Are there established methods/algorithms (the links mention SSIM) to deal with this kind of problems, or can any of you come up with better ways than described above? Maybe someone knows libraries for Java that would be suited well to the task (in the linked questions there's mention of a Java wrapper for OpenCV, then ImageJ, imgsclr)? Any help is appreciated.

like image 944
G. Bach Avatar asked Mar 05 '13 19:03

G. Bach


1 Answers

I think that the general answer to this question calls for an unsupervised machine learning approach that generates local invariant features - basically, a fancy way of finding hashes that don't change with scaling or rotation - and then running a clustering algorithm. Here are some papers that might be relevant:

  • Clustering Near-Duplicate Images in Large Collections
  • A Novel Duplicate Images Detection Method Based on PLSA Model
  • Efficient image duplicate detection based on image analysis - Tons of stuff in here, since it's some dude's entire PhD thesis
like image 188
Andrew Mao Avatar answered Nov 15 '22 04:11

Andrew Mao