Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using hashcode for a unique ID

I am working in a java-based system where I need to set an id for certain elements in the visual display. One category of elements is Strings, so I decided to use the String.hashCode() method to get a unique identifier for these elements.

The problem I ran into, however, is that the system I am working in borks if the id is negative and String.hashCode often returns negative values. One quick solution is to just use Math.abs() around the hashcode call to guarantee a positive result. What I was wondering about this approach is what are the chances of two distinct elements having the same hashcode?

For example, if one string returns a hashcode of -10 and another string returns a hashcode of 10 an error would occur. In my system we're talking about collections of objects that aren't more than 30 elements large typically so I don't think this would really be an issue, but I am curious as to what the math says.

like image 489
IcedDante Avatar asked Jan 26 '14 20:01

IcedDante


2 Answers

Hash codes can be thought of as pseudo-random numbers. Statistically, with a positive int hash code the chance of a collision between any two elements reaches 50% when the population size is about 54K (and 77K for any int). See Birthday Problem Probability Table for collision probabilities of various hash code sizes.

Also, your idea to use Math.abs() alone is flawed: It does not always return a positive number! In 2's compliment arithmetic, the absolute value of Integer.MIN_VALUE is itself! Famously, the hash code of "polygenelubricants" is this value.

like image 106
Bohemian Avatar answered Sep 19 '22 03:09

Bohemian


Hashes are not unique, hence they are not apropriate for uniqueId.

As to probability of hash collision, you could read about birthday paradox. Actually (from what I recall) when drawing from an uniform distribution of N values, you should expect collision after drawing $\sqrt(N)$ (you could get collision much earlier). The problem is that Java's implementation of hashCode (and especially when hashing short strings) doesnt provide uniform distribution, so you'll get collision much earlier.

like image 33
jb. Avatar answered Sep 21 '22 03:09

jb.