Can't understand Poisson part of Hash tables from Sun documentation

Tags:

I am trying to understand how HashMap is implemented in Java. I decided that I will try to understand every line (of code and comments) from that class and obviously I faced resistance very soon. The following snippet is from HashMap class and talks about Poisson Distribution:

 Ideally, under random hashCodes, the frequency of
 nodes in bins follows a Poisson distribution
 (http://en.wikipedia.org/wiki/Poisson_distribution) with a
 parameter of about 0.5 on average for the default resizing
 threshold of 0.75, although with a large variance because of
 resizing granularity. Ignoring variance, the expected
 occurrences of list size k are (exp(-0.5) * pow(0.5, k) /
 factorial(k)). The first values are:
 0:    0.60653066
 1:    0.30326533
 2:    0.07581633
 3:    0.01263606
 4:    0.00157952
 5:    0.00015795
 6:    0.00001316
 7:    0.00000094
 8:    0.00000006
 more: less than 1 in ten million

I am an average guy in Math and had to understand what Poisson distribution is first. Thanks to the simple video that explained it to me.

Now even after understanding how you calculate probability using Poisson I can't understand what is described above.

Can someone please explain this in simpler language and with an example if possible? It will make my task much more interesting.

405

asked Dec 08 '13 00:12

Asif

1 Answers

A HashMap is organized as an array of "buckets" based on the hashCode of the elements being inserted. Each bucket is (by default) a linked list of elements. Each bucket would have very few elements (ideally, at most one) so that finding a particular element requires very little searching down a linked list.

To take a simple example, let's say we have a HashMap of capacity 4 and a load factor of 0.75 (the default) which means that it can hold up to 3 elements before being resized. An ideal distribution of elements into buckets would look something like this:

bucket | elements
-------+---------
     0 | Z
     1 | X
     2 |
     3 | Y

so any element can be found immediately without any searching within a bucket. On the other hand, a very poor distribution of elements would look like this:

bucket | elements
-------+---------
     0 | 
     1 | Z -> X -> Y
     2 |
     3 |

This will occur if all of the elements happen to hash into the same bucket, so searching for element Y will require traversing down the linked list.

This might not seem like a big deal, but if you have a HashMap with a capacity of 10,000 elements and there are 7,500 elements in a single bucket on a linked list, searching for a particular element will degrade to linear search time -- which is what using a HashMap is trying to avoid.

One issue is that the hashCode for distributing elements into buckets is determined by the objects themselves, and objects' hashCode implementations aren't always very good. If the hashCode isn't very good, then elements can bunch up in certain buckets, and the HashMap will begin to perform poorly.

The comment from the code is talking about the likelihood of different lengths of linked lists appearing in each bucket. First, it assumes the hashCodes are randomly distributed -- which isn't always the case! -- and I think it also assumes that the number of elements in the HashMap is 50% of the number of buckets. Under these assumptions, according to that Poisson distribution, 60.6% of the buckets will be empty, 30.3% will have one element, 7.5% will have two elements, 1.2% will have three elements, and so forth.

In other words, given those (ideal) assumptions, the linked lists within each bucket will usually be very short.

In JDK 8 there is an optimization to turn a linked list into a tree above a certain threshold size, so that at least performance degrades to O(log n) instead of O(n) in the worst case. The question is, what value should be chosen as the threshold? That's what this discussion is all about. The current threshold value TREEIFY_THRESHOLD is 8. Again, under these ideal assumptions, a bucket with a linked list of length 8 will occur only 0.000006% of the time. So if we get a linked list that long, something is clearly not ideal!! It may mean, for instance, that the objects being stored have exceptionally bad hashCodes, so the HashMap has to switch from a linked list to a tree in order to avoid excessive performance degradation.

The link to the source file with the comment in question is here:

http://hg.openjdk.java.net/jdk8/jdk8/jdk/file/jdk8-b119/src/share/classes/java/util/HashMap.java

181

answered Oct 13 '22 05:10

Stuart Marks

Related questions
                            
                                Why does compiling this code cause a compiler stack overflow?
                            
                                What is the best Open Source Web Crawler Tool written in Java? [closed]
                            
                                How to create excel sheet with right-to-left alignment using JXL
                            
                                Google App Engine and CORS
                            
                                IntelliJ IDEA GUI builder – no Java code generated
                            
                                How to import java.nio.file package
                            
                                Get rid of the "UNKNOWN" publisher from applet security warning
                            
                                String length in pixels in Java
                            
                                Architectural Constraints In Java [closed]
                            
                                JVM HOT Swapping
                            
                                Guava: How to create an explicit Ordering from a List and a single element?
                            
                                About reference to object before object's constructor is finished
                            
                                Why use random numbers for serial version UID? [duplicate]
                            
                                Encoding/decoding of data between PHP/Java for Android
                            
                                Why do I need to explicitly cast a generic call?
                            
                                How to execute custom SQL query with spring-managed transactional EntityManager
                            
                                Invoke private method with java.lang.invoke.MethodHandle
                            
                                Java get generic type of collection [duplicate]
                            
                                Socket closed exception [duplicate]
                            
                                Java serialization and duplicate objects

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Can't understand Poisson part of Hash tables from Sun documentation

Tags:

java

hashmap

poisson

Asif

People also ask

1 Answers

Stuart Marks

Recent Activity

Donate For Us