Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is it okay to use a non-cryptographic hash to fingerprint a block of data?

Tags:

file

hash

My problem is this. I have a block of data. Occasionally this block of data is updated and a new changed version appears. I need to detect if the data I am looking at matches the version I am expecting to receive.

I have decided to use a fingerprint so that I can avoid storing the 'expected' version of the data in full. It seems that the 'default' choice for this kind of thing is an MD5 hash.

However MD5 was designed to be cryptographically secure. There are much faster hashing functions. I am looking at modern non-cryptographic functions such as CityHash and SpookyHash.

Since I control all the data in my system I only care about accidental collisions where a changed block of data hashes to the same value. Therefore I don't think I have to worry about the 'attacker-proof' nature of cryptographic hashes and could get away with a simpler hash function.

Are there any problems with using a hash function such as CityHash or SpookyHash for this purpose, or should I just stick with MD5? Or should I be using something specifically designed for fingerprinting such as a Rabin fingerprint?

like image 760
Edmund Kapusniak Avatar asked Nov 04 '22 16:11

Edmund Kapusniak


2 Answers

Yes, it's okay (also take a look at the even faster CRC series of functions). However I tend to avoid using hashes to differentiate data, using serial numbers combined with a date/time value provide a means to determine which version is newer and to detect out-of-sync changes. Fingerprints are used more to detect corrupted files rather than versioning.

If you want to compare one set of data with another, then don't use hashes/fingerprints, just compare the data directly. It's faster to compare two streams than it is to take the hashes of two streams and then compare the hashes.

That said, a good quick way to compare lots of files is to take the hashes of each file, then compare the hashes, and when there's a hash match you then compare the raw bytes. The chance of a hash collision is indeed minimal, but it isn't impossible - and I like to absolutely be sure.

like image 93
Dai Avatar answered Nov 08 '22 03:11

Dai


You may want to use the Rabin Hash, which is faster and more collision resilient than cryptographic hashes such as MD5, SHA1, et al. A Java implementation can be found here. Most large-scale deduplication efforts by web scale companies utilize Rabin Hash (for example, see Google's efforts led by Henzinger

like image 28
fjxx Avatar answered Nov 08 '22 05:11

fjxx