I got many, many files to be uploaded to the server, and I just want a way to avoid duplicates.
Thus, generating a unique and small key value from a big string seemed something that a checksum was intended to do, and hashing seemed like the evolution of that.
So I was going to use hash md5 to do this. But then I read somewhere that "MD5 are not meant to be unique keys" and I thought that's really weird.
What's the right way of doing this?
edit: by the way, I took two sources to get to the following, which is how I'm currently doing it and it's working just fine, with Python 2.5:
import hashlib
def md5_from_file (fileName, block_size=2**14):
md5 = hashlib.md5()
f = open(fileName)
while True:
data = f.read(block_size)
if not data:
break
md5.update(data)
f.close()
return md5.hexdigest()
Sticking with MD5 is a good idea. Just to make sure I'd append the file length or number of chunks to your file-hash table.
Yes, there is the possibility that you run into two files that have the same MD5 hash, but that's quite unlikely (if your files are decent sized). Thus adding the number of chunks to your hash may help you reduce that since now you have to find two files the same size with the same MD5.
# This is the algorithm you described, but also returns the number of chunks.
new_file_hash, nchunks = hash_for_tile(new_file)
store_file(new_file, nchunks, hash)
def store_file(file, nchunks, hash):
"" Tells you whether there is another file with the same contents already, by
making a table lookup ""
# This can be a DB lookup or some way to obtain your hash map
big_table = ObtainTable()
# Two level lookup table might help performance
# Will vary on the number of entries and nature of big_table
if nchunks in big_table:
if hash in big_table[hash]:
raise DuplicateFileException,\
'File is dup with %s' big_table[nchunks][lookup_hash]
else:
big_table[nchunks] = {}
big_table[nchunks].update({
hash: file.filename
})
file.save() # or something
To reduce that possibility switch to SHA1 and use the same method. Or even use both(concatenating) if performance is not an issue.
Of course, keep in mind that this will only work with duplicate files at binary level, not images, sounds, video that are "the same" but have different signatures.
The issue with hashing is that it's generating a "small" identifier from a "large" dataset. It's like a lossy compression. While you can't guarantee uniqueness, you can use it to substantially limit the number of other items you need to compare against.
Consider that MD5 yields a 128 bit value (I think that's what it is, although the exact number of bits is irrelevant). If your input data set has 129 bits and you actually use them all, each MD5 value will appear on average twice. For longer datasets (e.g. "all text files of exactly 1024 printable characters") you're still going to run into collisions once you get enough inputs. Contrary to what another answer said, it is a mathematical certainty that you will get collisions.
See http://en.wikipedia.org/wiki/Birthday_Paradox
Granted, you have around a 1% chance of collisions with a 128 bit hash at 2.6*10^18 entries, but it's better to handle the case that you do get collisions than to hope that you never will.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With