Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fast hash function with collision possibility near SHA-1

I'm using SHA-1 to detect duplicates in a program handling files. It is not required to be cryptographic strong and may be reversible. I found this list of fast hash functions https://code.google.com/p/xxhash/

What do I choose if I want a faster function and collision on random data near to SHA-1?

Maybe a 128 bit hash is good enough for file deduplication? (vs 160 bit sha-1)

In my program the hash is calculated on chuncks from 0 - 512 KB.

like image 393
Stig Avatar asked Feb 22 '15 16:02

Stig


1 Answers

The facts:

  1. Good hash functions, specially the cryptographic ones (like SHA-1), require considerable CPU time because they have to honor a number of properties that wont be very useful for you in this case;
  2. Any hash function will give you only one certainty: if the hash values of two files are different, the files are surely different. If, however, their hash values are equal, chances are that the files are also equal, but the only way to tell for sure if this "equality" is not just a hash collision, is to fall back to a binary comparison of the two files.

The conclusion:
In your case I would try a much faster algorithm like CRC32, that has pretty much all the properties you need, and would be capable of handling more than 99.9% of the cases and only resorting to a slower comparison method (like binary comparison) to rule out the false positives. Being a lot faster in the great majority of comparisons would probably compensate for not having an "awesome" uniformity (possibly generating a few more collisions).

like image 171
ulix Avatar answered Sep 21 '22 12:09

ulix