Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

see if two files have the same content in python [duplicate]

Tags:

python

file

Possible Duplicates:
Finding duplicate files and removing them.
In Python, is there a concise way of comparing whether the contents of two text files are the same?

What is the easiest way to see if two files are the same content-wise in Python.

One thing I can do is md5 each file and compare. Is there a better way?

like image 789
Josh Gibson Avatar asked Jul 02 '09 04:07

Josh Gibson


People also ask

How can I tell if two files have the same content?

We can see if two files have the same content by calculating their hash values. As we can see, file1 and file3 have the same content as their hashes match, whereas file2 is different.


1 Answers

Yes, I think hashing the file would be the best way if you have to compare several files and store hashes for later comparison. As hash can clash, a byte-by-byte comparison may be done depending on the use case.

Generally byte-by-byte comparison would be sufficient and efficient, which filecmp module already does + other things too.

See http://docs.python.org/library/filecmp.html e.g.

>>> import filecmp >>> filecmp.cmp('file1.txt', 'file1.txt') True >>> filecmp.cmp('file1.txt', 'file2.txt') False 

Speed consideration: Usually if only two files have to be compared, hashing them and comparing them would be slower instead of simple byte-by-byte comparison if done efficiently. e.g. code below tries to time hash vs byte-by-byte

Disclaimer: this is not the best way of timing or comparing two algo. and there is need for improvements but it does give rough idea. If you think it should be improved do tell me I will change it.

import random import string import hashlib import time  def getRandText(N):     return  "".join([random.choice(string.printable) for i in xrange(N)])  N=1000000 randText1 = getRandText(N) randText2 = getRandText(N)  def cmpHash(text1, text2):     hash1 = hashlib.md5()     hash1.update(text1)     hash1 = hash1.hexdigest()      hash2 = hashlib.md5()     hash2.update(text2)     hash2 = hash2.hexdigest()      return  hash1 == hash2  def cmpByteByByte(text1, text2):     return text1 == text2  for cmpFunc in (cmpHash, cmpByteByByte):     st = time.time()     for i in range(10):         cmpFunc(randText1, randText2)     print cmpFunc.func_name,time.time()-st 

and the output is

cmpHash 0.234999895096 cmpByteByByte 0.0 
like image 90
Anurag Uniyal Avatar answered Oct 14 '22 16:10

Anurag Uniyal