I'm writing a script to calculate the MD5 sum of an image excluding the EXIF tag.
In order to do this accurately, I need to know where the EXIF tag is located in the file (beginning, middle, end) so that I can exclude it.
How can I determine where in the file the tag is located?
The images that I am scanning are in the format TIFF, JPG, PNG, BMP, DNG, CR2, NEF, and some videos MOV, AVI, and MPG.
It is much easier to use the Python Imaging Library to extract the picture data (example in iPython):
In [1]: import Image In [2]: import hashlib In [3]: im = Image.open('foo.jpg') In [4]: hashlib.md5(im.tobytes()).hexdigest() Out[4]: '171e2774b2549bbe0e18ed6dcafd04d5'
This works on any type of image that PIL can handle. The tobytes
method returns the a string containing the pixel data.
BTW, the MD5 hash is now seen as pretty weak. Better to use SHA512:
In [6]: hashlib.sha512(im.tobytes()).hexdigest() Out[6]: '6361f4a2722f221b277f81af508c9c1d0385d293a12958e2c56a57edf03da16f4e5b715582feef3db31200db67146a4b52ec3a8c445decfc2759975a98969c34'
On my machine, calculating the MD5 checksum for a 2500x1600 JPEG takes around 0.07 seconds. Using SHA512, it takes 0,10 seconds. Complete example:
#!/usr/bin/env python3 from PIL import Image import hashlib import sys im = Image.open(sys.argv[1]) print(hashlib.sha512(im.tobytes()).hexdigest(), end="")
For movies, you can extract frames from them with e.g. ffmpeg, and then process them as shown above.
One simple way to do it is to hash the core image data. For PNG, you could do this by counting only the "critical chunks" (i.e. the ones starting with capital letters). JPEG has a similar but simpler file structure.
The visual hash in ImageMagick decompresses the image as it hashes it. In your case, you could hash the compressed image data right away, so (if implemented correctly) a it should be just as quick as hashing the raw file.
This is a small Python script illustrating the idea. It may or may not work for you, but it should at least give an indication to what I mean :)
import struct import os import hashlib def png(fh): hash = hashlib.md5() assert fh.read(8)[1:4] == "PNG" while True: try: length, = struct.unpack(">i",fh.read(4)) except struct.error: break if fh.read(4) == "IDAT": hash.update(fh.read(length)) fh.read(4) # CRC else: fh.seek(length+4,os.SEEK_CUR) print "Hash: %r" % hash.digest() def jpeg(fh): hash = hashlib.md5() assert fh.read(2) == "\xff\xd8" while True: marker,length = struct.unpack(">2H", fh.read(4)) assert marker & 0xff00 == 0xff00 if marker == 0xFFDA: # Start of stream hash.update(fh.read()) break else: fh.seek(length-2, os.SEEK_CUR) print "Hash: %r" % hash.digest() if __name__ == '__main__': png(file("sample.png")) jpeg(file("sample.jpg"))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With