Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can't understand Poisson part of Hash tables from Sun documentation

I am trying to understand how HashMap is implemented in Java. I decided that I will try to understand every line (of code and comments) from that class and obviously I faced resistance very soon. The following snippet is from HashMap class and talks about Poisson Distribution:

 Ideally, under random hashCodes, the frequency of
 nodes in bins follows a Poisson distribution
 (http://en.wikipedia.org/wiki/Poisson_distribution) with a
 parameter of about 0.5 on average for the default resizing
 threshold of 0.75, although with a large variance because of
 resizing granularity. Ignoring variance, the expected
 occurrences of list size k are (exp(-0.5) * pow(0.5, k) /
 factorial(k)). The first values are:
 0:    0.60653066
 1:    0.30326533
 2:    0.07581633
 3:    0.01263606
 4:    0.00157952
 5:    0.00015795
 6:    0.00001316
 7:    0.00000094
 8:    0.00000006
 more: less than 1 in ten million

I am an average guy in Math and had to understand what Poisson distribution is first. Thanks to the simple video that explained it to me.

Now even after understanding how you calculate probability using Poisson I can't understand what is described above.

Can someone please explain this in simpler language and with an example if possible? It will make my task much more interesting.

like image 405
Asif Avatar asked Dec 08 '13 00:12

Asif


People also ask

How do you show something is a Poisson process?

Poisson Process CriteriaEvents are independent of each other. The occurrence of one event does not affect the probability another event will occur. The average rate (events per time period) is constant. Two events cannot occur at the same time.

Is C++ map a hash table?

In C++, the sorted map (std::map) is usually implemented as a binary tree, and the unsorted map (std::unordered_map) is a hash table with closed addressing.

What is a hash table C++?

A Hash Table in C/C++ (Associative array) is a data structure that maps keys to values. This uses a hash function to compute indexes for a key. Based on the Hash Table index, we can store the value at the appropriate location.

Is a hash table an array?

In a hash table, data is stored in an array format, where each data value has its own unique index value. Access of data becomes very fast if we know the index of the desired data. Thus, it becomes a data structure in which insertion and search operations are very fast irrespective of the size of the data.


1 Answers

A HashMap is organized as an array of "buckets" based on the hashCode of the elements being inserted. Each bucket is (by default) a linked list of elements. Each bucket would have very few elements (ideally, at most one) so that finding a particular element requires very little searching down a linked list.

To take a simple example, let's say we have a HashMap of capacity 4 and a load factor of 0.75 (the default) which means that it can hold up to 3 elements before being resized. An ideal distribution of elements into buckets would look something like this:

bucket | elements
-------+---------
     0 | Z
     1 | X
     2 |
     3 | Y

so any element can be found immediately without any searching within a bucket. On the other hand, a very poor distribution of elements would look like this:

bucket | elements
-------+---------
     0 | 
     1 | Z -> X -> Y
     2 |
     3 |

This will occur if all of the elements happen to hash into the same bucket, so searching for element Y will require traversing down the linked list.

This might not seem like a big deal, but if you have a HashMap with a capacity of 10,000 elements and there are 7,500 elements in a single bucket on a linked list, searching for a particular element will degrade to linear search time -- which is what using a HashMap is trying to avoid.

One issue is that the hashCode for distributing elements into buckets is determined by the objects themselves, and objects' hashCode implementations aren't always very good. If the hashCode isn't very good, then elements can bunch up in certain buckets, and the HashMap will begin to perform poorly.

The comment from the code is talking about the likelihood of different lengths of linked lists appearing in each bucket. First, it assumes the hashCodes are randomly distributed -- which isn't always the case! -- and I think it also assumes that the number of elements in the HashMap is 50% of the number of buckets. Under these assumptions, according to that Poisson distribution, 60.6% of the buckets will be empty, 30.3% will have one element, 7.5% will have two elements, 1.2% will have three elements, and so forth.

In other words, given those (ideal) assumptions, the linked lists within each bucket will usually be very short.

In JDK 8 there is an optimization to turn a linked list into a tree above a certain threshold size, so that at least performance degrades to O(log n) instead of O(n) in the worst case. The question is, what value should be chosen as the threshold? That's what this discussion is all about. The current threshold value TREEIFY_THRESHOLD is 8. Again, under these ideal assumptions, a bucket with a linked list of length 8 will occur only 0.000006% of the time. So if we get a linked list that long, something is clearly not ideal!! It may mean, for instance, that the objects being stored have exceptionally bad hashCodes, so the HashMap has to switch from a linked list to a tree in order to avoid excessive performance degradation.

The link to the source file with the comment in question is here:

http://hg.openjdk.java.net/jdk8/jdk8/jdk/file/jdk8-b119/src/share/classes/java/util/HashMap.java

like image 181
Stuart Marks Avatar answered Oct 13 '22 05:10

Stuart Marks