When looking at the source code of java.lang.String of openjdk-1.6, i saw that String.hashCode() uses 31 as prime number and computes
s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]
Now the reason for me to look at this was the question i had in mind whether comparing hashCodes in String.equals would make String.equals significantly faster. But looking at hashCode now, the following questions come to my mind:
Wouldn't a bigger prime help avoid collisions better, at least for short strings, seeing that for example "BC" has the same hash as "Ab" (since the ascii letters live in the region 65-122, wouldn't a prime higher than that work better)?
Each character in a String can take 65536 values (2^16). The set of Strings of 1 or 2 characters is therefore larger than the number of int
and any hashcode calculation methodology will produce collisions for strings that are 1 or 2 character long (which qualify as short strings I suppose).
If you restrict your character set, you can find hash function that reduce the number of collisions (see below).
Note that a good hash must also provide a good distribution of output. A comment buried in this code advocates using 33 and gives the following reasons (emphasis mine):
If one compares the chi^2 values [...] of the variants the number 33 not even has the best value. But the number 33 and a few other equally good numbers like 17, 31, 63, 127 and 129 have nevertheless a great advantage to the remaining numbers in the large set of possible multipliers: their multiply operation can be replaced by a faster operation based on just one shift plus either a single addition or subtraction operation. And because a hash function has to both distribute good and has to be very fast to compute, those few numbers should be preferred.
Now these formulae were designed a while ago. Even if it appeared now that they are not ideal, it would be impossible to change the implementation because it is documented in the contract of the String class.
Is it a conscious decision to use 31 as prime, or just some random one that is used because it is common?
Why does Java's hashCode() in String use 31 as a multiplier?
How likely is a hash collision, given a fixed String length?
Assuming each possible int value has the same probability of being the result of the hashcode function, the probability of collision is 1 in 2^32.
Is there a good reason String.equals does not compare hashCodes as additional shortcut?
Why does the equals method in String not use hash?
assuming we have two Strings with the same content, but different instances: is there any way to assert equality without actually comparing the contents?
Without any constraint on the string, there isn't. You could intern the strings then check for reference equality (==
), but if many strings are involved, that can be inefficient.
how much do we need to restrict the string space to be able to have such a hash function?
If you only allow small cap letters (26 characters), you could design a hash function that generates unique hashes for any strings of length 0 to 6 characters (inclusive) (sum(i=0..6) (26^i) = 3.10^8
).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With