Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Handling close-to-impossible collisions on should-be-unique values

There are many systems that depend on the uniqueness of some particular value. Anything that uses GUIDs comes to mind (eg. the Windows registry or other databases), but also things that create a hash from an object to identify it and thus need this hash to be unique.

A hash table usually doesn't mind if two objects have the same hash because the hashing is just used to break down the objects into categories, so that on lookup, not all objects in the table, but only those objects in the same category (bucket) have to be compared for identity to the searched object.

Other implementations however (seem to) depend on the uniqueness. My example (that's what lead me to asking this) is Mercurial's revision IDs. An entry on the Mercurial mailing list correctly states

The odds of the changeset hash colliding by accident in your first billion commits is basically zero. But we will notice if it happens. And you'll get to be famous as the guy who broke SHA1 by accident.

But even the tiniest probability doesn't mean impossible. Now, I don't want an explanation of why it's totally okay to rely on the uniqueness (this has been discussed here for example). This is very clear to me.

Rather, I'd like to know (maybe by means of examples from your own work):

  • Are there any best practices as to covering these improbable cases anyway?

  • Should they be ignored, because it's more likely that particularly strong solar winds lead to faulty hard disk reads?

  • Should they at least be tested for, if only to fail with a "I give up, you have done the impossible" message to the user?

  • Or should even these cases get handled gracefully?

For me, especially the following are interesting, although they are somewhat touchy-feely:

  • If you don't handle these cases, what do you do against gut feelings that don't listen to probabilities?

  • If you do handle them, how do you justify this work (to yourself and others), considering there are more probable cases you don't handle, like a supernonva?

like image 474
balpha Avatar asked Jul 08 '09 12:07

balpha


People also ask

How do you handle collisions?

One method for resolving collisions looks into the hash table and tries to find another open slot to hold the item that caused the collision. A simple way to do this is to start at the original hash value position and then move in a sequential manner through the slots until we encounter the first slot that is empty.

What can be the techniques to avoid collision?

We can avoid collision by making hash function random, chaining method and uniform hashing.

How does Python dictionary handle collisions?

Python uses a method called Open Addressing for handling collisions. It also resizes the hash tables when it reaches a certain size, but we won't discuss that aspect. Open Addressing definition from Wikipedia: In another strategy, called open addressing, all entry records are stored in the bucket array itself.

Why are collisions bad to message integrity?

Collision attacks are a significant concern in the realm of cryptography. In certain circumstances, an attacker can use them to undermine the security granted by digital signatures, allowing them to fraudulently make data appear as if it has been verified for its integrity and authenticity.


2 Answers

  • If you do handle them, how do you justify this work (to yourself and others), considering there are more probable cases you don't handle, like a supernova?

The answer to that is you aren't testing to spot a GUID collision occurring by chance. You're testing to spot a GUID collision occurring because of a bug in the GUID code, or a precondition that the GUID code relies on that you've violated (or been tricked into violating by some attacker), such as in V1 that MAC addresses are unique and time goes forward. Either is considerably more likely than supernova-based bugs.

However, not every client of the GUID code should be testing its correctness, especially in production code. That's what unit tests are supposed to do, so trade off the cost of missing a bug that your actual use would catch but the unit tests didn't, against the cost of second-guessing your libraries all the time.

Note also that GUIDs only work if everyone who is generating them co-operates. If your app generates the IDs on machines you countrol, then you might not need GUIDs anyway - a locally unique ID like an incrementing counter might do you fine. Obviously Mercurial can't use that, hence it uses hashes, but eventually SHA-1 will fall to an attack that generates collisions (or, even worse, pre-images), and they'll have to change.

If your app generates non-hash "GUIDs" on machines you don't control, like clients, then forget about accidental collisions, you're worried about deliberate collisions by malicious clients trying to DOS your server. Protecting yourself against that will probably protect you against accidents anyway.

  • Or should even these cases get handled gracefully?

The answer to this is probably "no". If you could handle colliding GUIDs gracefully, like a hashtable does, then why bother with GUIDs at all? The whole point of an "identifier" is that if two things have the same ID, then they're the same. If you don't want to treat them the same, just initially direct them into buckets like a hashtable does, then use a different scheme (like a hash).

like image 154
Steve Jessop Avatar answered Oct 05 '22 13:10

Steve Jessop


Given a good 128 bit hash, the probably of colliding with a specific hash value given a random input is:

1 / 2 ** 128 which is approximately equal to 3 * 10 ** -39.

The probability of seeing no collisions (p) given n samples can be computed using the logic used to explain the birthday problem.

p = (2 ** 128)! / (2 ** (128 * n) * (2 ** 128 - n)!)

where !denotes the factorial function. We can then plot the probability of no collisions as the number of samples increases:

Probability of a random SHA-1 collision as the number of samples increases. http://img21.imageshack.us/img21/9186/sha1collision.png

Between 10**17 and 10**18 hashes we begin to see non-trivial possibilities of collision from 0.001% to 0.14% and finally 13% with 10**19 hashes. So in a system with a million, billion, records counting on uniqueness is probably unwise (and such systems are conceivable), but in the vast majority of systems the probability of a collision is so small that you can rely on the uniqueness of your hashes for all practical purposes.

Now, theory aside, it is far more likely that collisions could be introduced into your system either through bugs or someone attacking your system and so onebyone's answer provides good reasons to check for collisions even though the probability of an accidental collision are vanishingly small (that is to say the probability of bugs or malice is much higher than an accidental collision).

like image 38
Aaron Maenpaa Avatar answered Oct 05 '22 12:10

Aaron Maenpaa