Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Creating a unique key based on file content in python

I got many, many files to be uploaded to the server, and I just want a way to avoid duplicates.

Thus, generating a unique and small key value from a big string seemed something that a checksum was intended to do, and hashing seemed like the evolution of that.

So I was going to use hash md5 to do this. But then I read somewhere that "MD5 are not meant to be unique keys" and I thought that's really weird.

What's the right way of doing this?

edit: by the way, I took two sources to get to the following, which is how I'm currently doing it and it's working just fine, with Python 2.5:

import hashlib

def md5_from_file (fileName, block_size=2**14):
    md5 = hashlib.md5()
    f = open(fileName)
    while True:
        data = f.read(block_size)
        if not data:
            break
        md5.update(data)
    f.close()
    return md5.hexdigest()
like image 811
cregox Avatar asked May 04 '10 22:05

cregox


2 Answers

Sticking with MD5 is a good idea. Just to make sure I'd append the file length or number of chunks to your file-hash table.

Yes, there is the possibility that you run into two files that have the same MD5 hash, but that's quite unlikely (if your files are decent sized). Thus adding the number of chunks to your hash may help you reduce that since now you have to find two files the same size with the same MD5.

# This is the algorithm you described, but also returns the number of chunks.
new_file_hash, nchunks = hash_for_tile(new_file)
store_file(new_file, nchunks, hash)

def store_file(file, nchunks, hash):
  "" Tells you whether there is another file with the same contents already, by 
     making a table lookup ""
  # This can be a DB lookup or some way to obtain your hash map
  big_table = ObtainTable()

  # Two level lookup table might help performance
  # Will vary on the number of entries and nature of big_table
  if nchunks in big_table:
     if hash in big_table[hash]:
       raise DuplicateFileException,\
         'File is dup with %s' big_table[nchunks][lookup_hash]
  else:
    big_table[nchunks] = {}

  big_table[nchunks].update({
    hash: file.filename
  })

  file.save() # or something

To reduce that possibility switch to SHA1 and use the same method. Or even use both(concatenating) if performance is not an issue.

Of course, keep in mind that this will only work with duplicate files at binary level, not images, sounds, video that are "the same" but have different signatures.

like image 133
Jj. Avatar answered Sep 21 '22 23:09

Jj.


The issue with hashing is that it's generating a "small" identifier from a "large" dataset. It's like a lossy compression. While you can't guarantee uniqueness, you can use it to substantially limit the number of other items you need to compare against.

Consider that MD5 yields a 128 bit value (I think that's what it is, although the exact number of bits is irrelevant). If your input data set has 129 bits and you actually use them all, each MD5 value will appear on average twice. For longer datasets (e.g. "all text files of exactly 1024 printable characters") you're still going to run into collisions once you get enough inputs. Contrary to what another answer said, it is a mathematical certainty that you will get collisions.

See http://en.wikipedia.org/wiki/Birthday_Paradox

Granted, you have around a 1% chance of collisions with a 128 bit hash at 2.6*10^18 entries, but it's better to handle the case that you do get collisions than to hope that you never will.

like image 39
dash-tom-bang Avatar answered Sep 21 '22 23:09

dash-tom-bang