Possible Duplicates:
Finding duplicate files and removing them.
In Python, is there a concise way of comparing whether the contents of two text files are the same?
What is the easiest way to see if two files are the same content-wise in Python.
One thing I can do is md5 each file and compare. Is there a better way?
We can see if two files have the same content by calculating their hash values. As we can see, file1 and file3 have the same content as their hashes match, whereas file2 is different.
Yes, I think hashing the file would be the best way if you have to compare several files and store hashes for later comparison. As hash can clash, a byte-by-byte comparison may be done depending on the use case.
Generally byte-by-byte comparison would be sufficient and efficient, which filecmp module already does + other things too.
See http://docs.python.org/library/filecmp.html e.g.
>>> import filecmp >>> filecmp.cmp('file1.txt', 'file1.txt') True >>> filecmp.cmp('file1.txt', 'file2.txt') False
Speed consideration: Usually if only two files have to be compared, hashing them and comparing them would be slower instead of simple byte-by-byte comparison if done efficiently. e.g. code below tries to time hash vs byte-by-byte
Disclaimer: this is not the best way of timing or comparing two algo. and there is need for improvements but it does give rough idea. If you think it should be improved do tell me I will change it.
import random import string import hashlib import time def getRandText(N): return "".join([random.choice(string.printable) for i in xrange(N)]) N=1000000 randText1 = getRandText(N) randText2 = getRandText(N) def cmpHash(text1, text2): hash1 = hashlib.md5() hash1.update(text1) hash1 = hash1.hexdigest() hash2 = hashlib.md5() hash2.update(text2) hash2 = hash2.hexdigest() return hash1 == hash2 def cmpByteByByte(text1, text2): return text1 == text2 for cmpFunc in (cmpHash, cmpByteByByte): st = time.time() for i in range(10): cmpFunc(randText1, randText2) print cmpFunc.func_name,time.time()-st
and the output is
cmpHash 0.234999895096 cmpByteByByte 0.0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With