Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

JPEG built-in checksum / fingerprint?

I'm putting together a script to find remove duplicates in a large library of images. At the moment I'm doing a two pass filter of first finding files of the same size and then doing a sha256 on a 10240 byte piece of the file to get a fingerprint of the files with the same size (code here).

It works well, but I'm guessing there are probably checksums built in to the jpeg format that I could use instead of doing the sha256.

Does anyone know if there are checksums or other components that could act as checksums / fingerprints? If so, is there an efficient way to access them?

like image 848
Parand Avatar asked Oct 20 '08 04:10

Parand


3 Answers

I don't think the JPEG specification includes any kind of checksum in the way you're describing.

A JPEG can contain a thumbnail as part of its EXIF metadata, though. It's not a perfect indicator, since it's possible for two different images to have the same thumbnail. There's at least one documented case of a thumbnail not being replaced after the image had undergone substantial modifications, said thumbnail revealing much more than the publisher had intended.

like image 96
Mark Ransom Avatar answered Nov 08 '22 16:11

Mark Ransom


Its been awhile since I've dug into the IJG library, but I don't think there's an easy class member or function call you can use there to check for some type of fingerprint. You could use the built in EXIF tags if you can control the encoding of the images...

like image 36
jdt141 Avatar answered Nov 08 '22 14:11

jdt141


I'm just built a very similar script. I don't want to checksum metadata I want to see if the actual images are duplicates even if tags have been modified. Best for that is not to sort by size, but do sort by the checksum istelf. I use jhead to remove metadata and then checksum the whole file (but I also thought about just doing part of it, but actually I don't think it saves much time). jhead doesn't use shared memory (pipes) and does overwrite so I just copy the file to shared memory first. I place the checksum in the ImageDescription field for later faster retrieval. Obviously this also allows to check image integrity later and is part of why I checksum the whole thing. Tip: exiv2 is MUCH faster for reading and writing the metadata than exiftool for one at a time decision based manipulation.

like image 1
Doug Avatar answered Nov 08 '22 16:11

Doug