Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Comparing images to find duplicates

I have a few (38000) picture/video files in a folder. Approximately 40% of these are duplicates which I'm trying to get rid of. My question is, how can I tell if 2 files are identical? So far I tried to use a SHA1 of the files but it turns out that many duplicates files had different hashes. This is the code I was using:

public static String getHash(File doc) {
    MessageDigest md = null;
    try {
        md = MessageDigest.getInstance("SHA1");
        FileInputStream inStream = new FileInputStream(doc);
        DigestInputStream dis = new DigestInputStream(inStream, md);
        BufferedInputStream bis = new BufferedInputStream(dis);
        while (true) {
            int b = bis.read();
            if (b == -1)
                break;
        }

        inStream.close();
        dis.close();
        bis.close();
    } catch (NoSuchAlgorithmException | IOException e) {
        e.printStackTrace();
    }

    BigInteger bi = new BigInteger(md.digest());

    return bi.toString(16);
}

Can I modify this in any way? Or will I have to use a different method?

like image 325
spacitron Avatar asked Jun 24 '13 18:06

spacitron


People also ask

What is the best technique to detect duplicate images?

Perceptual Hash It's fast to compute and lookup is as fast as with a file hash. The possibility to calculate a distance between two perceptual hashes allows to detect not only identical images, but also close matches with tiny changes.

Does Windows 10 have a duplicate photo finder?

Does Windows 10 have a built-in duplicate file finder app? No, Windows 10 doesn't have a built-in file finder. But, you can do this manually through the Windows photos app. You can also download duplicate file remover and run it.

Is there a program to find duplicate pictures?

AllDup for Windows The “Comparison Method” section also lets you decide whether AllDup will look for duplicates within the same folder (for when you've taken a lot similar photos at once) or only between different folders (for photos you might've stashed in multiple places).


1 Answers

As outlined above duplicate detection can be based on a hash. However, if you want to have near duplicate detection, which means that you are searching for images that basically show the same things, but have been scaled, rotated, etc. you might need a content based image retrieval approach. There's LIRE (https://code.google.com/p/lire/), a Java library for that, and you'll find the "SimpleApplication" in the Download section. What you then can do is to

  1. Index the first image
  2. go to the next image I
  3. Search for I in the index
  4. If there are results with a score below a threshold, then mark them as duplicate
  5. Index I
  6. Go to (2)

Students of mine did it, it worked well, but I don't have the source code at hand. But rest assured, it's just a few lines and the simple application will get you started.

like image 110
Mathias Avatar answered Oct 12 '22 23:10

Mathias