Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How well do Non-cryptographic hashes detect errors in data vs. CRC-32 etc.?

Non-cryptographic hashes such as MurmurHash3 and xxHash are almost exclusively designed for hash tables, but they appear to function comparably (and even favorably) to CRC-32, Adler-32 and Fletcher-32. Non-crypto hashes are often faster than CRC-32 and produce more "random" output similar to slow cryptographic hashes (MD5, SHA). Despite this, I only ever see CRC-32 or MD5 recommended for data integrity/checksum purposes.

In the table below, I tested 32-bit checksum/CRC/hash functions to determine how well they detect small differences in data:

Table

The results in each cell means: A) number of collisions found, and B) minimum and maximum probability that any of the 32 output bits are set to 1. To pass test B, the max and min should be as close as possible to 50. Anything under 45 or over 55 indicates bias.


Looking at the table, MurmurHash3 and Jenkins lookup2 compare favorably to CRC-32 (which actually fails one test). They are also well-distributed. DJB2 and FNV1a pass collision tests but aren't well distributed. Fletcher32 and Adler32 struggle with the NullBytes and 8RandBytes tests.

So then my question is, compared to other checksums, how suitable are 'non-cryptographic hashes' for detecting errors or differences in files? Is there any reason a CRC-32/Adler-32/CRC-64 might outperform any decent 32-bit/64-bit hash?

like image 528
bryc Avatar asked Feb 09 '18 20:02

bryc


1 Answers

Is there any reason this function would be inferior to CRC-32 or Adler-32 for detecting errors in data?

Yes, for certain kinds of error characteristics. A CRC can be designed to very effectively detect small numbers of bit errors in a packet, as you might expect on an actual communications or storage channel. That's what it's designed for.

For large numbers of errors, any 32-bit check that fills the 32 bits and does a reasonably good job of being sensitive to all of the bits of the packet will work about as well as any other. So your's would be as good as a CRC-32, and a smidge better than an Adler-32. (The Adler-32 deliberately does not use all possible 32-bit values, so has a slightly higher false positive rate than 32-bit checks that use all possible values.)

By the way, looking a little more at your algorithm, it does not distribute over all 32-bit values until you have many bytes of input. So your check would not be as good as any other 32-bit check on a large number of errors until you have covered the possible 32-bit values of the check.

like image 102
Mark Adler Avatar answered Oct 19 '22 17:10

Mark Adler