Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

MD5 and SHA-2 collisions in Python

I'm writing a simple MP3 cataloguer to keep track of which MP3's are on my various devices. I was planning on using MD5 or SHA2 keys to identify matching files even if they have been renamed/moved, etc. I'm not trying to match MP3's that are logically equivalent (i.e.: same song but encoded differently). I have about 8000 MP3's. Only about 6700 of them generated unique keys.

My problem is that I'm running into collisions regardless of the hashing algorithm I choose. In one case, I have two files that happen to be tracks #1 and #2 on the same album, they are different file sizes yet produce identical hash keys whether I use MD5, SHA2-256, SHA2-512, etc...

This is the first time I'm really using hash keys on files and this is an unexpected result. I feel something fishy is going on here from the little I know about these hashing algorithms. Could this be an issue related to MP3's or Python's implementation?

Here's the snippet of code that I'm using:

    data = open(path, 'r').read()

    m = hashlib.md5(data)

    m.update(data)

    md5String = m.hexdigest()

Any answers or insights to why this is happening would be much appreciated. Thanks in advance.

--UPDATE--:

I tried executing this code in linux (with Python 2.6) and it did not produce a collision. As demonstrated by the stat call, the files are not the same. I also downloaded WinMD5 and this did not produce a collision(8d327ef3937437e0e5abbf6485c24bb3 and 9b2c66781cbe8c1be7d6a1447994430c). Is this a bug with Python hashlib on Windows? I tried the same under Python 2.7.1 and 2.6.6 and both provide the same result.

import hashlib
import os

def createMD5( path):

    fh = open(path, 'r')
    data = fh.read()
    m = hashlib.md5(data)
    md5String = m.hexdigest()
    fh.close()
    return md5String

print os.stat(path1)
print os.stat(path2)
print createMD5(path1)
print createMD5(path2)

>>> nt.stat_result(st_mode=33206, st_ino=0L, st_dev=0, st_nlink=0, st_uid=0, st_gid=0, st_size=6617216L, st_atime=1303808346L, st_mtime=1167098073L, st_ctime=1290222341L)
>>> nt.stat_result(st_mode=33206, st_ino=0L, st_dev=0, st_nlink=0, st_uid=0, st_gid=0, st_size=4921346L, st_atime=1303808348L, st_mtime=1167098076L, st_ctime=1290222341L)   
>>> a7a10146b241cddff031eb03bd572d96
>>> a7a10146b241cddff031eb03bd572d96
like image 889
Jesse Avatar asked Apr 26 '11 07:04

Jesse


2 Answers

I sort of have the feeling that you are reading a chunk of data which is smaller than the expected, and this chunk happens to be the same for both files. I don't know why, but try to open the file in binary with 'rb'. read() should read up to end of file, but windows behaves differently. From the docs

On Windows, 'b' appended to the mode opens the file in binary mode, so there are also modes like 'rb', 'wb', and 'r+b'. Python on Windows makes a distinction between text and binary files; the end-of-line characters in text files are automatically altered slightly when data is read or written. This behind-the-scenes modification to file data is fine for ASCII text files, but it’ll corrupt binary data like that in JPEG or EXE files. Be very careful to use binary mode when reading and writing such files. On Unix, it doesn’t hurt to append a 'b' to the mode, so you can use it platform-independently for all binary files.

like image 122
Stefano Borini Avatar answered Oct 11 '22 14:10

Stefano Borini


The files you're having a problem with are almost certainly identical if several different hashing algorithms all return the same hash results on them, or there's a bug in your implementation.

As a sanity test write your own "hash" that just returns the file's contents wholly, and see if this one generates the same "hashes".

like image 40
Eli Bendersky Avatar answered Oct 11 '22 15:10

Eli Bendersky