Trying to solve a problem of preventing duplicate images to be uploaded.
I have two JPGs. Looking at them I can see that they are in fact identical. But for some reason they have different file size (one is pulled from a backup, the other is another upload) and so they have a different md5 checksum.
How can I efficiently and confidently compare two images in the same sense as a human would be able to see that they are clearly identical?
Example: http://static.peterbe.com/a.jpg and http://static.peterbe.com/b.jpg
Update
I wrote this script:
import math, operator from PIL import Image def compare(file1, file2): image1 = Image.open(file1) image2 = Image.open(file2) h1 = image1.histogram() h2 = image2.histogram() rms = math.sqrt(reduce(operator.add, map(lambda a,b: (a-b)**2, h1, h2))/len(h1)) return rms if __name__=='__main__': import sys file1, file2 = sys.argv[1:] print compare(file1, file2)
Then I downloaded the two visually identical images and ran the script. Output:
58.9830484122
Can anybody tell me what a suitable cutoff should be?
Update II
The difference between a.jpg and b.jpg is that the second one has been saved with PIL:
b=Image.open('a.jpg') b.save(open('b.jpg','wb'))
This apparently applies some very very light quality modifications. I've now solved my problem by applying the same PIL save to the file being uploaded without doing anything with it and it now works!
Simple dummy method: resize the largest image to match the size of the smallest image and compare. To compare two images i and j , resize the largest of them to the dimensions of the other one using 3-lobed lanczos, which is conveniently available in PIL by doing img1. resize(img2. size, Image.
There is a OSS project that uses WebDriver to take screen shots and then compares the images to see if there are any issues (http://code.google.com/p/fighting-layout-bugs/)). It does it by openning the file into a stream and then comparing every bit.
You may be able to do something similar with PIL.
EDIT:
After more research I found
h1 = Image.open("image1").histogram() h2 = Image.open("image2").histogram() rms = math.sqrt(reduce(operator.add, map(lambda a,b: (a-b)**2, h1, h2))/len(h1))
on http://snipplr.com/view/757/compare-two-pil-images-in-python/ and http://effbot.org/zone/pil-comparing-images.htm
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With