Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is any 64-bit portion of a 128-bit hash as collision-proof as a 64-bit hash?

We're trying to settle an internal debate on our dev team:

We're looking for a 64-bit PHP hash function. We found a PHP implementation of MurmurHash3, but MurmurHash3 is either 32-bit or 128-bit, not 64-bit.

Co-worker #1 believes that to produce a 64-bit hash from MurmurHash3, we can simply slice the first (or last, or any) 64 bits of the 128-bit hash and that it will be as collision-proof as a native 64-bit hash function.

Co-worker #2 believes that we must find a native 64-bit hash function to reduce collisions and that 64-bit slices of a 128-bit hash will not be as collision proof as a native 64-bit hash.

Who's correct?

Does the answer change if we take the first (or last, or any) 64-bits of a cryptographic hash like SHA1 instead of Murmur3?

like image 750
richardkmiller Avatar asked Jul 13 '12 17:07

richardkmiller


People also ask

Is 128 bit hash secure?

From the security perspective they are all bad because 128 bit is nowadays much too short. Therefore you want a cryptographically secure hash take e.g. RIPEMD-160 (which is AFAIR not as much broken as SHA-1). If don't need real security it doesn't matter if you use md5 or murmur3 or whatever 128 bit hash.

What are the 3 types of the hash collision algorithms?

Take into account the following hash algorithms – CRC-32, MD5, and SHA-1. These are common hash algorithms with varying levels of collision risk.

What is 64 bit hash?

All that is needed per hash function are three 64-bit values chosen at random, and then two multiplications, two additions and a single shift. The two multiplications are faster than they appear because they can be executed simultaneously as there is no data dependency.

Do all hash functions have collisions?

Every hash function with more inputs than outputs will necessarily have collisions. Consider a hash function such as SHA-256 that produces 256 bits of output from an arbitrarily large input.


1 Answers

If you had real random, uniformly distributed values, then "slicing" would yield exactly the same results as if you had started with the smaller value right from the start. To see why, consider this very simple example: Let's say your random generator outputs 3 random bits, but you only need one random bit to work with. Let's assume the output is

b1 b2 b3

The possible values are

000, 001, 010, 011, 100, 101, 110, 111

and all are to occur with equal probability of 1/8. Now whatever bit you slice from those three for your purpose - the first, second or third - the probability of having a '1' is always going to be 1/2, regardless of the position - and the same is true for a '0'.

You can easily scale this experiment to the 64 out of 128 bit case: regardless of which bits you slice, the probability of ending up with a one or a zero in a certain position is going to be one half. What this means is that if you had a sample taken from a uniformly distributed random variable, then slicing wouldn't make the probability for collisions more or less likely.

Now a good question is whether random functions are really the best we can do to prevent collisions. But as it turns out, it can be shown that the probability of finding collisions increases whenever a function deviates from random.

Cryptographic hash functions: co-worker #1 wins

The problem in real life is that hash functions are not random at all, on the contrary, they are boringly deterministic. But a design goal of cryptographic hash functions is as follows: if we didn't know their initial state, then their output would be computationally indistinguishable from a real random function, that is there's no computationally efficient way to tell the difference between the hash output and real random values. This is why you'd consider a hash already as kind of broken if you can find a "distinguisher", a method to tell the hash from real random values with a probability higher than one half. Unfortunately, we can't really prove these properties for existing cryptographic hashes, but unless somebody breaks them, we may assume these properties hold with some confidence. Here is an example of a paper about a distinguisher for one of the SHA-3 submissions that illustrates the process.

To summarize, unless a distinguisher is found for a given cryptographic hash, slicing is perfectly fine and does not increase the probability of a collision.

Non-cryptographic hash functions: co-worker #2 might win

Non-cryptographic hashes do not have to satisfy the same set of requirements as cryptographic hashes do. They are usually defined to be very fast and satisfy certain properties "under sane/benevolent conditions", but they might easily fall short if somebody tries to maliciously manipulate them. A good example for what this means in practice is the computational complexity attack on hash table implementations (hashDoS) presented earlier this year. Under normal conditions, non-crypto hashes work perfectly fine, but their collision resistance may be severely undermined by some clever inputs. This can't happen with cryptographic hash functions, because their very definition requires them to be immune to all sorts of clever inputs.

Because it is possible, sometimes even quite easy, to find a distinguisher like above for the output of non-cryptographic hashes, we can immediately say that they do not qualify as cryptographic hash functions. Being able to tell the difference means that somewhere there is a pattern or bias in the output.

And this fact alone implies that they deviate more or less from a random function, and thus (after what we said above) collisions are probably more likely than they would be for random functions. Finally, since collisions occur with higher probability for the full 128 bits already, this will not get better with shorter ouptputs, collisions will probably be even more likely in that case.

tl;dr You're safe with a cryptographic hash function when truncating it. But you're better off with a "native" 64 bit cryptographic hash function compared to truncating a non-cryptographic hash with a larger output to 64 bits.

like image 111
emboss Avatar answered Nov 07 '22 21:11

emboss