Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is String.hashCode() inefficient? [closed]

Tags:

java

string

hash

When looking at the source code of java.lang.String of openjdk-1.6, i saw that String.hashCode() uses 31 as prime number and computes

s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]

Now the reason for me to look at this was the question i had in mind whether comparing hashCodes in String.equals would make String.equals significantly faster. But looking at hashCode now, the following questions come to my mind:

  • Wouldn't a bigger prime help avoid collisions better, at least for short strings, seeing that for example "BC" has the same hash as "Ab" (since the ascii letters live in the region 65-122, wouldn't a prime higher than that work better)?
  • Is it a conscious decision to use 31 as prime, or just some random one that is used because it is common?
  • How likely is a hash collision, given a fixed String length? where this question is heading is the original question how good comparing hashCodes and String length could already discern strings, to avoid comparing the actual contents.
  • a little off-topic, maybe: Is there a good reason String.equals does not compare hashCodes as additional shortcut?
  • a little more off-topic: assuming we have two Strings with the same content, but different instances: is there any way to assert equality without actually comparing the contents? I would guess not, since someway into String lengths, the space explodes into sizes where we will inevitably have collisions, but what about some restrictions - only a certain character set, a maximum string length... and how much do we need to restrict the string space to be able to have such a hash function?
like image 523
kutschkem Avatar asked Jul 17 '13 08:07

kutschkem


1 Answers

Wouldn't a bigger prime help avoid collisions better, at least for short strings, seeing that for example "BC" has the same hash as "Ab" (since the ascii letters live in the region 65-122, wouldn't a prime higher than that work better)?

Each character in a String can take 65536 values (2^16). The set of Strings of 1 or 2 characters is therefore larger than the number of int and any hashcode calculation methodology will produce collisions for strings that are 1 or 2 character long (which qualify as short strings I suppose).

If you restrict your character set, you can find hash function that reduce the number of collisions (see below).

Note that a good hash must also provide a good distribution of output. A comment buried in this code advocates using 33 and gives the following reasons (emphasis mine):

If one compares the chi^2 values [...] of the variants the number 33 not even has the best value. But the number 33 and a few other equally good numbers like 17, 31, 63, 127 and 129 have nevertheless a great advantage to the remaining numbers in the large set of possible multipliers: their multiply operation can be replaced by a faster operation based on just one shift plus either a single addition or subtraction operation. And because a hash function has to both distribute good and has to be very fast to compute, those few numbers should be preferred.

Now these formulae were designed a while ago. Even if it appeared now that they are not ideal, it would be impossible to change the implementation because it is documented in the contract of the String class.

Is it a conscious decision to use 31 as prime, or just some random one that is used because it is common?

Why does Java's hashCode() in String use 31 as a multiplier?

How likely is a hash collision, given a fixed String length?

Assuming each possible int value has the same probability of being the result of the hashcode function, the probability of collision is 1 in 2^32.

Is there a good reason String.equals does not compare hashCodes as additional shortcut?

Why does the equals method in String not use hash?

assuming we have two Strings with the same content, but different instances: is there any way to assert equality without actually comparing the contents?

Without any constraint on the string, there isn't. You could intern the strings then check for reference equality (==), but if many strings are involved, that can be inefficient.

how much do we need to restrict the string space to be able to have such a hash function?

If you only allow small cap letters (26 characters), you could design a hash function that generates unique hashes for any strings of length 0 to 6 characters (inclusive) (sum(i=0..6) (26^i) = 3.10^8).

like image 52
assylias Avatar answered Nov 11 '22 21:11

assylias