Handling close-to-impossible collisions on should-be-unique values

Tags:

There are many systems that depend on the uniqueness of some particular value. Anything that uses GUIDs comes to mind (eg. the Windows registry or other databases), but also things that create a hash from an object to identify it and thus need this hash to be unique.

A hash table usually doesn't mind if two objects have the same hash because the hashing is just used to break down the objects into categories, so that on lookup, not all objects in the table, but only those objects in the same category (bucket) have to be compared for identity to the searched object.

Other implementations however (seem to) depend on the uniqueness. My example (that's what lead me to asking this) is Mercurial's revision IDs. An entry on the Mercurial mailing list correctly states

The odds of the changeset hash colliding by accident in your first billion commits is basically zero. But we will notice if it happens. And you'll get to be famous as the guy who broke SHA1 by accident.

But even the tiniest probability doesn't mean impossible. Now, I don't want an explanation of why it's totally okay to rely on the uniqueness (this has been discussed here for example). This is very clear to me.

Rather, I'd like to know (maybe by means of examples from your own work):

Are there any best practices as to covering these improbable cases anyway?
Should they be ignored, because it's more likely that particularly strong solar winds lead to faulty hard disk reads?
Should they at least be tested for, if only to fail with a "I give up, you have done the impossible" message to the user?
Or should even these cases get handled gracefully?

For me, especially the following are interesting, although they are somewhat touchy-feely:

If you don't handle these cases, what do you do against gut feelings that don't listen to probabilities?
If you do handle them, how do you justify this work (to yourself and others), considering there are more probable cases you don't handle, like a supernonva?

474

asked Jul 08 '09 12:07

balpha

2 Answers

If you do handle them, how do you justify this work (to yourself and others), considering there are more probable cases you don't handle, like a supernova?

The answer to that is you aren't testing to spot a GUID collision occurring by chance. You're testing to spot a GUID collision occurring because of a bug in the GUID code, or a precondition that the GUID code relies on that you've violated (or been tricked into violating by some attacker), such as in V1 that MAC addresses are unique and time goes forward. Either is considerably more likely than supernova-based bugs.

However, not every client of the GUID code should be testing its correctness, especially in production code. That's what unit tests are supposed to do, so trade off the cost of missing a bug that your actual use would catch but the unit tests didn't, against the cost of second-guessing your libraries all the time.

Note also that GUIDs only work if everyone who is generating them co-operates. If your app generates the IDs on machines you countrol, then you might not need GUIDs anyway - a locally unique ID like an incrementing counter might do you fine. Obviously Mercurial can't use that, hence it uses hashes, but eventually SHA-1 will fall to an attack that generates collisions (or, even worse, pre-images), and they'll have to change.

If your app generates non-hash "GUIDs" on machines you don't control, like clients, then forget about accidental collisions, you're worried about deliberate collisions by malicious clients trying to DOS your server. Protecting yourself against that will probably protect you against accidents anyway.

Or should even these cases get handled gracefully?

The answer to this is probably "no". If you could handle colliding GUIDs gracefully, like a hashtable does, then why bother with GUIDs at all? The whole point of an "identifier" is that if two things have the same ID, then they're the same. If you don't want to treat them the same, just initially direct them into buckets like a hashtable does, then use a different scheme (like a hash).

154

answered Oct 05 '22 13:10

Steve Jessop

Given a good 128 bit hash, the probably of colliding with a specific hash value given a random input is:

1 / 2 ** 128 which is approximately equal to 3 * 10 ** -39.

The probability of seeing no collisions (p) given n samples can be computed using the logic used to explain the birthday problem.

p = (2 ** 128)! / (2 ** (128 * n) * (2 ** 128 - n)!)

where !denotes the factorial function. We can then plot the probability of no collisions as the number of samples increases:

Probability of a random SHA-1 collision as the number of samples increases. http://img21.imageshack.us/img21/9186/sha1collision.png

Between 10**17 and 10**18 hashes we begin to see non-trivial possibilities of collision from 0.001% to 0.14% and finally 13% with 10**19 hashes. So in a system with a million, billion, records counting on uniqueness is probably unwise (and such systems are conceivable), but in the vast majority of systems the probability of a collision is so small that you can rely on the uniqueness of your hashes for all practical purposes.

Now, theory aside, it is far more likely that collisions could be introduced into your system either through bugs or someone attacking your system and so onebyone's answer provides good reasons to check for collisions even though the probability of an accidental collision are vanishingly small (that is to say the probability of bugs or malice is much higher than an accidental collision).

answered Oct 05 '22 12:10

Aaron Maenpaa

Related questions
                            
                                Remove the minimum number of blades
                            
                                Random integers in array. Find the greatest sum of a continuous subset [duplicate]
                            
                                Find Shortest Binary String In Given Interval
                            
                                OCR: segmentation of small text
                            
                                Lazy Shuffle Algorithms
                            
                                How do programming languages call code written in another language?
                            
                                Why is compare-and-swap (CAS) algorithm a good choice for lock-free synchronization?
                            
                                When should I consider changing thread priority
                            
                                How can a transform a polynomial to another coordinate system?
                            
                                Managing Code Written for learning [closed]
                            
                                Documenting getters and setters [closed]
                            
                                Finding coordinates of points on the border of a shape
                            
                                How to subsample a 2D polygon?
                            
                                Artifical intelligence that can learn [closed]
                            
                                Accessing an array element directly vs. assigning it to a variable
                            
                                unit-test the template method design pattern
                            
                                How to partition an array of integers in a way that minimizes the maximum of the sum of each partition?
                            
                                What is differentiable programming?
                            
                                What is the best way to sort a partially ordered list?
                            
                                What's a good data structure for building equivalence classes on nodes of a tree?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Handling close-to-impossible collisions on should-be-unique values

Tags:

language-agnostic

unique

balpha

People also ask

2 Answers

Steve Jessop

Aaron Maenpaa

Recent Activity

Donate For Us