As per the following link document: Java HashMap Implementation
I'm confused with the implementation of HashMap
(or rather, an enhancement in HashMap
). My queries are:
Firstly
static final int TREEIFY_THRESHOLD = 8; static final int UNTREEIFY_THRESHOLD = 6; static final int MIN_TREEIFY_CAPACITY = 64;
Why and how are these constants used? I want some clear examples for this. How they are achieving a performance gain with this?
Secondly
If you see the source code of HashMap
in JDK, you will find the following static inner class:
static final class TreeNode<K, V> extends java.util.LinkedHashMap.Entry<K, V> { HashMap.TreeNode<K, V> parent; HashMap.TreeNode<K, V> left; HashMap.TreeNode<K, V> right; HashMap.TreeNode<K, V> prev; boolean red; TreeNode(int arg0, K arg1, V arg2, HashMap.Node<K, V> arg3) { super(arg0, arg1, arg2, arg3); } final HashMap.TreeNode<K, V> root() { HashMap.TreeNode arg0 = this; while (true) { HashMap.TreeNode arg1 = arg0.parent; if (arg0.parent == null) { return arg0; } arg0 = arg1; } } //... }
How is it used? I just want an explanation of the algorithm.
The hash will change from using a linked list to a balanced tree. Above changes ensure the performance of O(log(n)) in worst case scenarios and O(1) with proper hashCode(). The alternative String hash function added in Java 7 has been removed.
Hashmap uses the array of Nodes(named as table), where Node has fields like the key, value (and much more). Here the Node is represented by class HashMapEntry. Basically, HashMap has an array where the key-value data is stored. It calculates the index in the array where the Node can be placed and it is placed there.
HashMap uses its static inner class Node<K,V> for storing map entries. That means each entry in hashMap is a Node . Internally HashMap uses a hashCode of the key Object and this hashCode is further used by the hash function to find the index of the bucket where the new entry can be added.
HashMap
contains a certain number of buckets. It uses hashCode
to determine which bucket to put these into. For simplicity's sake imagine it as a modulus.
If our hashcode is 123456 and we have 4 buckets, 123456 % 4 = 0
so the item goes in the first bucket, Bucket 1.
If our hashCode
function is good, it should provide an even distribution so that all the buckets will be used somewhat equally. In this case, the bucket uses a linked list to store the values.
But you can't rely on people to implement good hash functions. People will often write poor hash functions which will result in a non-even distribution. It's also possible that we could just get unlucky with our inputs.
The less even this distribution is, the further we're moving from O(1) operations and the closer we're moving towards O(n) operations.
The implementation of HashMap tries to mitigate this by organising some buckets into trees rather than linked lists if the buckets become too large. This is what TREEIFY_THRESHOLD = 8
is for. If a bucket contains more than eight items, it should become a tree.
This tree is a Red-Black tree, presumably chosen because it offers some worst-case guarantees. It is first sorted by hash code. If the hash codes are the same, it uses the compareTo
method of Comparable
if the objects implement that interface, else the identity hash code.
If entries are removed from the map, the number of entries in the bucket might reduce such that this tree structure is no longer necessary. That's what the UNTREEIFY_THRESHOLD = 6
is for. If the number of elements in a bucket drops below six, we might as well go back to using a linked list.
Finally, there is the MIN_TREEIFY_CAPACITY = 64
.
When a hash map grows in size, it automatically resizes itself to have more buckets. If we have a small HashMap, the likelihood of us getting very full buckets is quite high, because we don't have that many different buckets to put stuff into. It's much better to have a bigger HashMap, with more buckets that are less full. This constant basically says not to start making buckets into trees if our HashMap is very small - it should resize to be larger first instead.
To answer your question about the performance gain, these optimisations were added to improve the worst case. You would probably only see a noticeable performance improvement because of these optimisations if your hashCode
function was not very good.
It is designed to protect against bad hashCode
implementations and also provides basic protection against collision attacks, where a bad actor may attempt to slow down a system by deliberately selecting inputs which occupy the same buckets.
To put it simpler (as much as I could simpler) + some more details.
These properties depend on a lot of internal things that would be very cool to understand - before moving to them directly.
TREEIFY_THRESHOLD -> when a single bucket reaches this (and the total number exceeds MIN_TREEIFY_CAPACITY
), it is transformed into a perfectly balanced red/black tree node. Why? Because of search speed. Think about it in a different way:
it would take at most 32 steps to search for an Entry within a bucket/bin with Integer.MAX_VALUE entries.
Some intro for the next topic. Why is the number of bins/buckets always a power of two? At least two reasons: faster than modulo operation and modulo on negative numbers will be negative. And you can't put an Entry into a "negative" bucket:
int arrayIndex = hashCode % buckets; // will be negative buckets[arrayIndex] = Entry; // obviously will fail
Instead there is a nice trick used instead of modulo:
(n - 1) & hash // n is the number of bins, hash - is the hash function of the key
That is semantically the same as modulo operation. It will keep the lower bits. This has an interesting consequence when you do:
Map<String, String> map = new HashMap<>();
In the case above, the decision of where an entry goes is taken based on the last 4 bits only of you hashcode.
This is where multiplying the buckets comes into play. Under certain conditions (would take a lot of time to explain in exact details), buckets are doubled in size. Why? When buckets are doubled in size, there is one more bit coming into play.
So you have 16 buckets - last 4 bits of the hashcode decide where an entry goes. You double the buckets: 32 buckets - 5 last bits decide where entry will go.
As such this process is called re-hashing. This might get slow. That is (for people who care) as HashMap is "joked" as: fast, fast, fast, slooow. There are other implementations - search pauseless hashmap...
Now UNTREEIFY_THRESHOLD comes into play after re-hashing. At that point, some entries might move from this bins to others (they add one more bit to the (n-1)&hash
computation - and as such might move to other buckets) and it might reach this UNTREEIFY_THRESHOLD
. At this point it does not pay off to keep the bin as red-black tree node
, but as a LinkedList
instead, like
entry.next.next....
MIN_TREEIFY_CAPACITY is the minimum number of buckets before a certain bucket is transformed into a Tree.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With