Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Compute hash of only the core image data (excluding metadata) for an image

Tags:

I'm writing a script to calculate the MD5 sum of an image excluding the EXIF tag.

In order to do this accurately, I need to know where the EXIF tag is located in the file (beginning, middle, end) so that I can exclude it.

How can I determine where in the file the tag is located?

The images that I am scanning are in the format TIFF, JPG, PNG, BMP, DNG, CR2, NEF, and some videos MOV, AVI, and MPG.

like image 321
ensnare Avatar asked Apr 09 '12 14:04

ensnare


2 Answers

It is much easier to use the Python Imaging Library to extract the picture data (example in iPython):

In [1]: import Image  In [2]: import hashlib  In [3]: im = Image.open('foo.jpg')  In [4]: hashlib.md5(im.tobytes()).hexdigest() Out[4]: '171e2774b2549bbe0e18ed6dcafd04d5' 

This works on any type of image that PIL can handle. The tobytes method returns the a string containing the pixel data.

BTW, the MD5 hash is now seen as pretty weak. Better to use SHA512:

In [6]: hashlib.sha512(im.tobytes()).hexdigest() Out[6]: '6361f4a2722f221b277f81af508c9c1d0385d293a12958e2c56a57edf03da16f4e5b715582feef3db31200db67146a4b52ec3a8c445decfc2759975a98969c34' 

On my machine, calculating the MD5 checksum for a 2500x1600 JPEG takes around 0.07 seconds. Using SHA512, it takes 0,10 seconds. Complete example:

#!/usr/bin/env python3  from PIL import Image import hashlib import sys  im = Image.open(sys.argv[1]) print(hashlib.sha512(im.tobytes()).hexdigest(), end="") 

For movies, you can extract frames from them with e.g. ffmpeg, and then process them as shown above.

like image 130
Roland Smith Avatar answered Oct 31 '22 04:10

Roland Smith


One simple way to do it is to hash the core image data. For PNG, you could do this by counting only the "critical chunks" (i.e. the ones starting with capital letters). JPEG has a similar but simpler file structure.

The visual hash in ImageMagick decompresses the image as it hashes it. In your case, you could hash the compressed image data right away, so (if implemented correctly) a it should be just as quick as hashing the raw file.

This is a small Python script illustrating the idea. It may or may not work for you, but it should at least give an indication to what I mean :)

import struct import os import hashlib  def png(fh):     hash = hashlib.md5()     assert fh.read(8)[1:4] == "PNG"     while True:         try:             length, = struct.unpack(">i",fh.read(4))         except struct.error:             break         if fh.read(4) == "IDAT":             hash.update(fh.read(length))             fh.read(4) # CRC         else:             fh.seek(length+4,os.SEEK_CUR)     print "Hash: %r" % hash.digest()  def jpeg(fh):     hash = hashlib.md5()     assert fh.read(2) == "\xff\xd8"     while True:         marker,length = struct.unpack(">2H", fh.read(4))         assert marker & 0xff00 == 0xff00         if marker == 0xFFDA: # Start of stream             hash.update(fh.read())             break         else:             fh.seek(length-2, os.SEEK_CUR)     print "Hash: %r" % hash.digest()   if __name__ == '__main__':     png(file("sample.png"))     jpeg(file("sample.jpg")) 
like image 23
Krumelur Avatar answered Oct 31 '22 05:10

Krumelur