I noticed in the Java 6 source code for String that hashCode only caches values other than 0. The difference in performance is exhibited by the following snippet: <pre class="prettyprint"><code>public class Main{ static void test(String s) { long start = System.currentTimeMillis(); for (int i = 0; i < 10000000; i++) { s.hashCode(); } System.out.format("Took %d ms.%n", System.currentTimeMillis() - start); } public static void main(String[] args) { String z = "Allocator redistricts; strict allocator redistricts strictly."; test(z); test(z.toUpperCase()); } } </code></pre> Running this in ideone.com gives the following output: <pre class="prettyprint"><code>Took 1470 ms. Took 58 ms. </code></pre> So my questions are: <ul> <li>Why doesn't String's hashCode() cache 0?</li> <li>What is the probability that a Java string hashes to 0?</li> <li>What's the best way to avoid the performance penalty of recomputing the hash value every time for strings that hash to 0?</li> <li>Is this the best-practice way of caching values? (i.e. cache all except one?)</li> </ul> <hr> For your amusement, each line here is a string that hash to 0: <pre class="prettyprint"><code>pollinating sandboxes amusement & hemophilias schoolworks = perversive electrolysissweeteners.net constitutionalunstableness.net grinnerslaphappier.org BLEACHINGFEMININELY.NET WWW.BUMRACEGOERS.ORG WWW.RACCOONPRUDENTIALS.NET Microcomputers: the unredeemed lollipop... Incentively, my dear, I don't tessellate a derangement. A person who never yodelled an apology, never preened vocalizing transsexuals. </code></pre>

It uses 0 to indicate "I haven't worked out the hashcode yet". The alternative would be to use a separate Boolean flag, which would take more memory. (Or to not cache the hashcode at all, of course.) I don't expect many strings hash to 0; arguably it would make sense for the hashing routine to deliberately avoid 0 (e.g. translate a hash of 0 to 1, and cache that). That would increase collisions but avoid rehashing. It's too late to do that now though, as the String hashCode algorithm is explicitly documented. As for whether this is a good idea in general: it's an certainly efficient caching mechanism, and might (see edit) be even better with a change to avoid rehashing values which end up with a hash of 0. Personally I would be interested to see the data which led Sun to believe this was worth doing in the first place - it's taking up an extra 4 bytes for every string ever created, however often or rarely it's hashed, and the only benefit is for strings which are hashed more than once. EDIT: As KevinB points out in a comment elsewhere, the "avoid 0" suggestion above may well have a net cost because it helps a very rare case, but requires an extra comparison for every hash calculation.

Why doesn't String's hashCode() cache 0?

Tags:

java

string

hashcode

I noticed in the Java 6 source code for String that hashCode only caches values other than 0. The difference in performance is exhibited by the following snippet:

public class Main{    static void test(String s) {       long start = System.currentTimeMillis();       for (int i = 0; i < 10000000; i++) {          s.hashCode();       }       System.out.format("Took %d ms.%n", System.currentTimeMillis() - start);    }    public static void main(String[] args) {       String z = "Allocator redistricts; strict allocator redistricts strictly.";       test(z);       test(z.toUpperCase());    } }

Running this in ideone.com gives the following output:

Took 1470 ms. Took 58 ms.

So my questions are:

Why doesn't String's hashCode() cache 0?
What is the probability that a Java string hashes to 0?
What's the best way to avoid the performance penalty of recomputing the hash value every time for strings that hash to 0?
Is this the best-practice way of caching values? (i.e. cache all except one?)

For your amusement, each line here is a string that hash to 0:

pollinating sandboxes amusement & hemophilias schoolworks = perversive electrolysissweeteners.net constitutionalunstableness.net grinnerslaphappier.org BLEACHINGFEMININELY.NET WWW.BUMRACEGOERS.ORG WWW.RACCOONPRUDENTIALS.NET Microcomputers: the unredeemed lollipop... Incentively, my dear, I don't tessellate a derangement. A person who never yodelled an apology, never preened vocalizing transsexuals.

259

asked Feb 22 '10 11:02

polygenelubricants

2 Answers

You're worrying about nothing. Here's a way to think about this issue.

Suppose you have an application that does nothing but sit around hashing Strings all year long. Let's say it takes a thousand strings, all in memory, calls hashCode() on them repeatedly in round-robin fashion, a million times through, then gets another thousand new strings and does it again.

And suppose that the likelihood of a string's hash code being zero were, in fact, much greater than 1/2^32. I'm sure it is somewhat greater than 1/2^32, but let's say it's a lot worse than that, like 1/2^16 (the square root! now that's a lot worse!).

In this situation, you have more to benefit from Oracle's engineers improving how these strings' hash codes are cached than anyone else alive. So you write to them and ask them to fix it. And they work their magic so that whenever s.hashCode() is zero, it returns instantaneously (even the first time! a 100% improvement!). And let's say that they do this without degrading the performance at all for any other case.

Hooray! Now your app is... let's see... 0.0015% faster!

What used to take an entire day now takes only 23 hours, 57 minutes and 48 seconds!

And remember, we set up the scenario to give every possible benefit of the doubt, often to a ludicrous degree.

Does this seem worth it to you?

EDIT: since posting this a couple hours ago, I've let one of my processors run wild looking for two-word phrases with zero hash codes. So far it's come up with: bequirtle zorillo, chronogrammic schtoff, contusive cloisterlike, creashaks organzine, drumwood boulderhead, electroanalytic exercisable, and favosely nonconstruable. This is out of about 2^35 possibilities, so with perfect distribution we'd expect to see only 8. Clearly by the time it's done we'll have a few times that many, but not outlandishly more. What's more significant is that I've now come up with a few interesting band names/album names! No fair stealing!

110

answered Oct 16 '22 09:10

Kevin Bourrillion

It uses 0 to indicate "I haven't worked out the hashcode yet". The alternative would be to use a separate Boolean flag, which would take more memory. (Or to not cache the hashcode at all, of course.)

I don't expect many strings hash to 0; arguably it would make sense for the hashing routine to deliberately avoid 0 (e.g. translate a hash of 0 to 1, and cache that). That would increase collisions but avoid rehashing. It's too late to do that now though, as the String hashCode algorithm is explicitly documented.

As for whether this is a good idea in general: it's an certainly efficient caching mechanism, and might (see edit) be even better with a change to avoid rehashing values which end up with a hash of 0. Personally I would be interested to see the data which led Sun to believe this was worth doing in the first place - it's taking up an extra 4 bytes for every string ever created, however often or rarely it's hashed, and the only benefit is for strings which are hashed more than once.

EDIT: As KevinB points out in a comment elsewhere, the "avoid 0" suggestion above may well have a net cost because it helps a very rare case, but requires an extra comparison for every hash calculation.

answered Oct 16 '22 10:10

Jon Skeet

Related questions
                            
                                Why does this Java 8 program not compile?
                            
                                > vs. >= in bubble sort causes significant performance difference
                            
                                Is there a package manager for Java like easy_install for Python? [closed]
                            
                                Analogues of Java and .NET technologies/frameworks
                            
                                What are Reified Generics? How do they solve Type Erasure problems and why can't they be added without major changes?
                            
                                How to terminate a thread blocking on socket IO operation instantly?
                            
                                Parsing PDF files (especially with tables) with PDFBox
                            
                                Updating Java PriorityQueue when its elements change priority
                            
                                Why does Math.ceil return a double?
                            
                                Method call to Future.get() blocks. Is that really desirable?
                            
                                How to write a unit test for a Spring Boot Controller endpoint
                            
                                Is there a java setting for disabling certificate validation? [duplicate]
                            
                                Where are all the standard Android icon resources?
                            
                                Mockito verify after exception Junit 4.10
                            
                                Java Log Viewer [closed]
                            
                                What is the difference between SOAP 1.1, SOAP 1.2, HTTP GET & HTTP POST methods for Android?
                            
                                Is multi-thread output from System.out.println interleaved
                            
                                Why do we need Maven or Ant, if we already have Eclipse?
                            
                                When is the static block of a class executed?
                            
                                What exactly does -XX:-TieredCompilation do?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With