Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why doesn't String's hashCode() cache 0?

I noticed in the Java 6 source code for String that hashCode only caches values other than 0. The difference in performance is exhibited by the following snippet:

public class Main{    static void test(String s) {       long start = System.currentTimeMillis();       for (int i = 0; i < 10000000; i++) {          s.hashCode();       }       System.out.format("Took %d ms.%n", System.currentTimeMillis() - start);    }    public static void main(String[] args) {       String z = "Allocator redistricts; strict allocator redistricts strictly.";       test(z);       test(z.toUpperCase());    } } 

Running this in ideone.com gives the following output:

Took 1470 ms. Took 58 ms. 

So my questions are:

  • Why doesn't String's hashCode() cache 0?
  • What is the probability that a Java string hashes to 0?
  • What's the best way to avoid the performance penalty of recomputing the hash value every time for strings that hash to 0?
  • Is this the best-practice way of caching values? (i.e. cache all except one?)

For your amusement, each line here is a string that hash to 0:

pollinating sandboxes amusement & hemophilias schoolworks = perversive electrolysissweeteners.net constitutionalunstableness.net grinnerslaphappier.org BLEACHINGFEMININELY.NET WWW.BUMRACEGOERS.ORG WWW.RACCOONPRUDENTIALS.NET Microcomputers: the unredeemed lollipop... Incentively, my dear, I don't tessellate a derangement. A person who never yodelled an apology, never preened vocalizing transsexuals. 
like image 259
polygenelubricants Avatar asked Feb 22 '10 11:02

polygenelubricants


People also ask

Is hashCode cached?

When the hash is calculated, it's saved to the cache.

Can hashCode of two strings be same in Java?

Two same strings/value must have the same hashcode, but the converse is not true. There might be another string which can match the same hash-code, so we can't derive the key using hash-code. The reason for two different string to have the same hash-code is due to the collision.

What is the purpose of the hashCode () method?

The purpose of the hashCode() method is to provide a numeric representation of an object's contents so as to provide an alternate mechanism to loosely identify it. By default the hashCode() returns an integer that represents the internal memory address of the object.

What is hashCode empty string?

HashCode of empty string will be zero. If a string does not contain any characters or string is empty, it will return 'h' as default value of integer which is Zero.


2 Answers

You're worrying about nothing. Here's a way to think about this issue.

Suppose you have an application that does nothing but sit around hashing Strings all year long. Let's say it takes a thousand strings, all in memory, calls hashCode() on them repeatedly in round-robin fashion, a million times through, then gets another thousand new strings and does it again.

And suppose that the likelihood of a string's hash code being zero were, in fact, much greater than 1/2^32. I'm sure it is somewhat greater than 1/2^32, but let's say it's a lot worse than that, like 1/2^16 (the square root! now that's a lot worse!).

In this situation, you have more to benefit from Oracle's engineers improving how these strings' hash codes are cached than anyone else alive. So you write to them and ask them to fix it. And they work their magic so that whenever s.hashCode() is zero, it returns instantaneously (even the first time! a 100% improvement!). And let's say that they do this without degrading the performance at all for any other case.

Hooray! Now your app is... let's see... 0.0015% faster!

What used to take an entire day now takes only 23 hours, 57 minutes and 48 seconds!

And remember, we set up the scenario to give every possible benefit of the doubt, often to a ludicrous degree.

Does this seem worth it to you?

EDIT: since posting this a couple hours ago, I've let one of my processors run wild looking for two-word phrases with zero hash codes. So far it's come up with: bequirtle zorillo, chronogrammic schtoff, contusive cloisterlike, creashaks organzine, drumwood boulderhead, electroanalytic exercisable, and favosely nonconstruable. This is out of about 2^35 possibilities, so with perfect distribution we'd expect to see only 8. Clearly by the time it's done we'll have a few times that many, but not outlandishly more. What's more significant is that I've now come up with a few interesting band names/album names! No fair stealing!

like image 110
Kevin Bourrillion Avatar answered Oct 16 '22 09:10

Kevin Bourrillion


It uses 0 to indicate "I haven't worked out the hashcode yet". The alternative would be to use a separate Boolean flag, which would take more memory. (Or to not cache the hashcode at all, of course.)

I don't expect many strings hash to 0; arguably it would make sense for the hashing routine to deliberately avoid 0 (e.g. translate a hash of 0 to 1, and cache that). That would increase collisions but avoid rehashing. It's too late to do that now though, as the String hashCode algorithm is explicitly documented.

As for whether this is a good idea in general: it's an certainly efficient caching mechanism, and might (see edit) be even better with a change to avoid rehashing values which end up with a hash of 0. Personally I would be interested to see the data which led Sun to believe this was worth doing in the first place - it's taking up an extra 4 bytes for every string ever created, however often or rarely it's hashed, and the only benefit is for strings which are hashed more than once.

EDIT: As KevinB points out in a comment elsewhere, the "avoid 0" suggestion above may well have a net cost because it helps a very rare case, but requires an extra comparison for every hash calculation.

like image 28
Jon Skeet Avatar answered Oct 16 '22 10:10

Jon Skeet