Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

HashMap rehash/resize capacity

Tags:

A HashMap has such a phrase from it's documentation:

If the initial capacity is greater than the maximum number of entries divided by the load factor, no rehash operations will ever occur.

Notice how the documentation says rehash, not resize - even if a rehash will only happen when a resize will; that is when the internal size of buckets gets twice as big.

And of course HashMap provides such a constructor where we could define this initial capacity.

Constructs an empty HashMap with the specified initial capacity and the default load factor (0.75).

OK, seems easy enough:

// these are NOT chosen randomly...     List<String> list = List.of("DFHXR", "YSXFJ", "TUDDY",            "AXVUH", "RUTWZ", "DEDUC", "WFCVW", "ZETCU", "GCVUR");  int maxNumberOfEntries = list.size(); // 9 double loadFactor = 0.75;  int capacity = (int) (maxNumberOfEntries / loadFactor + 1); // 13 

So capacity is 13 (internally it is 16 - next power of two), this way we guarantee that documentation part about no rehashes. Ok let's test this, but first introduce a method that will go into a HashMap and look at the values:

private static <K, V> void debugResize(Map<K, V> map, K key, V value) throws Throwable {      Field table = map.getClass().getDeclaredField("table");     table.setAccessible(true);     Object[] nodes = ((Object[]) table.get(map));      // first put     if (nodes == null) {         // not incrementing currentResizeCalls because         // of lazy init; or the first call to resize is NOT actually a "resize"         map.put(key, value);         return;     }      int previous = nodes.length;     map.put(key, value);     int current = ((Object[]) table.get(map)).length;      if (previous != current) {         ++HashMapResize.currentResizeCalls;         System.out.println(nodes.length + "   " + current);     } } 

And now let's test this:

static int currentResizeCalls = 0;  public static void main(String[] args) throws Throwable {      List<String> list = List.of("DFHXR", "YSXFJ", "TUDDY",             "AXVUH", "RUTWZ", "DEDUC", "WFCVW", "ZETCU", "GCVUR");     int maxNumberOfEntries = list.size(); // 9     double loadFactor = 0.75;     int capacity = (int) (maxNumberOfEntries / loadFactor + 1);      Map<String, String> map = new HashMap<>(capacity);      list.forEach(x -> {         try {             HashMapResize.debugResize(map, x, x);         } catch (Throwable throwable) {             throwable.printStackTrace();         }     });      System.out.println(HashMapResize.currentResizeCalls);  } 

Well, resize was called and thus entries where rehashed, not what the documentation says.


As said, the keys were not chosen randomly. These were set-up so that they would trigger the static final int TREEIFY_THRESHOLD = 8; property - when a bucket is converted to a tree. Well not really, since we need to hit also MIN_TREEIFY_CAPACITY = 64 for the tree to appear; until than resize happens, or a bucket is doubled in size; thus rehashing of entries happens.

I can only hint to why HashMap documentation is wrong in that sentence, since before java-8, a bucket was not converted to a Tree; thus the property would hold, from java-8 and onwards that is not true anymore. Since I am not sure about this, I'm not adding this as an answer.

like image 585
Eugene Avatar asked Oct 07 '18 20:10

Eugene


People also ask

What happens when HashMap is resized?

In Oracle JDK 8, HashMap resizes when the size is > threshold (capacity * load factor). With capacity of 16 and default load factor of 0.75 , resizing (to capacity of 32 ) takes place when the 13 th entry is put.

How HashMap increases its size?

As the number of elements in the HashMap increases, the capacity is expanded. The load factor is the measure that decides when to increase the capacity of the Map. The default load factor is 75% of the capacity. The threshold of a HashMap is approximately the product of current capacity and load factor.

Does HashMap size affects the performance of HashMap?

HashMap 's get has an expected constant running time, which means its running time shouldn't depend on the size of the HashMap .

Why would you do rehashing in HashMap?

Rehashing of a hash map is done when the number of elements in the map reaches the maximum threshold value. When rehashing occurs a new hash function or even the same hash function could be used but the buckets at which the values are present could change.


1 Answers

The line from the documentation,

If the initial capacity is greater than the maximum number of entries divided by the load factor, no rehash operations will ever occur.

indeed dates from before the tree-bin implementation was added in JDK 8 (JEP 180). You can see this text in the JDK 1.6 HashMap documentation. In fact, this text dates all the way back to JDK 1.2 when the Collections Framework (including HashMap) was introduced. You can find unofficial versions of the JDK 1.2 docs around the web, or you can download a version from the archives if you want to see for yourself.

I believe this documentation was correct up until the tree-bin implementation was added. However, as you've observed, there are now cases where it's incorrect. The policy is not only that resizing can occur if the number of entries divided by the load factor exceeds the capacity (really, table length). As you noted, resizes can also occur if the number of entries in a single bucket exceeds TREEIFY_THRESHOLD (currently 8) but the table length is smaller than MIN_TREEIFY_CAPACITY (currently 64).

You can see this decision in the treeifyBin() method of HashMap.

    if (tab == null || (n = tab.length) < MIN_TREEIFY_CAPACITY)         resize();     else if ((e = tab[index = (n - 1) & hash]) != null) { 

This point in the code is reached when there are more than TREEIFY_THRESHOLD entries in a single bucket. If the table size is at or above MIN_TREEIFY_CAPACITY, this bin is treeified; otherwise, the table is simply resized.

Note that this can leave bins with rather more entries than TREEIFY_THRESHOLD at small table sizes. This isn't terribly difficult to demonstrate. First, some reflective HashMap-dumping code:

// run with --add-opens java.base/java.util=ALL-UNNAMED  static Class<?> classNode; static Class<?> classTreeNode; static Field fieldNodeNext; static Field fieldHashMapTable;  static void init() throws ReflectiveOperationException {     classNode = Class.forName("java.util.HashMap$Node");     classTreeNode = Class.forName("java.util.HashMap$TreeNode");     fieldNodeNext = classNode.getDeclaredField("next");     fieldNodeNext.setAccessible(true);     fieldHashMapTable = HashMap.class.getDeclaredField("table");     fieldHashMapTable.setAccessible(true); }  static void dumpMap(HashMap<?, ?> map) throws ReflectiveOperationException {     Object[] table = (Object[])fieldHashMapTable.get(map);     System.out.printf("map size = %d, table length = %d%n", map.size(), table.length);     for (int i = 0; i < table.length; i++) {         Object node = table[i];         if (node == null)             continue;         System.out.printf("table[%d] = %s", i,             classTreeNode.isInstance(node) ? "TreeNode" : "BasicNode");          for (; node != null; node = fieldNodeNext.get(node))             System.out.print(" " + node);         System.out.println();     } } 

Now, let's add a bunch of strings that all fall into the same bucket. These strings are chosen such that their hash values, as computed by HashMap, are all 0 mod 64.

public static void main(String[] args) throws ReflectiveOperationException {     init();     List<String> list = List.of(         "LBCDD", "IKBNU", "WZQAG", "MKEAZ", "BBCHF", "KRQHE", "ZZMWH", "FHLVH",         "ZFLXM", "TXXPE", "NSJDQ", "BXDMJ", "OFBCR", "WVSIG", "HQDXY");      HashMap<String, String> map = new HashMap<>(1, 10.0f);      for (String s : list) {         System.out.println("===> put " + s);         map.put(s, s);         dumpMap(map);     } } 

Starting from an initial table size of 1 and a ridiculous load factor, this puts 8 entries into the lone bucket. Then, each time another entry is added, the table is resized (doubled) but all the entries end up in the same bucket. This eventually results in a table of size 64 with one bucket having a linear chain of nodes ("basic nodes") of length 14, before adding the next entry finally converts this to a tree.

Output of the program is as follows:

===> put LBCDD map size = 1, table length = 1 table[0] = BasicNode LBCDD=LBCDD ===> put IKBNU map size = 2, table length = 1 table[0] = BasicNode LBCDD=LBCDD IKBNU=IKBNU ===> put WZQAG map size = 3, table length = 1 table[0] = BasicNode LBCDD=LBCDD IKBNU=IKBNU WZQAG=WZQAG ===> put MKEAZ map size = 4, table length = 1 table[0] = BasicNode LBCDD=LBCDD IKBNU=IKBNU WZQAG=WZQAG MKEAZ=MKEAZ ===> put BBCHF map size = 5, table length = 1 table[0] = BasicNode LBCDD=LBCDD IKBNU=IKBNU WZQAG=WZQAG MKEAZ=MKEAZ BBCHF=BBCHF ===> put KRQHE map size = 6, table length = 1 table[0] = BasicNode LBCDD=LBCDD IKBNU=IKBNU WZQAG=WZQAG MKEAZ=MKEAZ BBCHF=BBCHF KRQHE=KRQHE ===> put ZZMWH map size = 7, table length = 1 table[0] = BasicNode LBCDD=LBCDD IKBNU=IKBNU WZQAG=WZQAG MKEAZ=MKEAZ BBCHF=BBCHF KRQHE=KRQHE ZZMWH=ZZMWH ===> put FHLVH map size = 8, table length = 1 table[0] = BasicNode LBCDD=LBCDD IKBNU=IKBNU WZQAG=WZQAG MKEAZ=MKEAZ BBCHF=BBCHF KRQHE=KRQHE ZZMWH=ZZMWH FHLVH=FHLVH ===> put ZFLXM map size = 9, table length = 2 table[0] = BasicNode LBCDD=LBCDD IKBNU=IKBNU WZQAG=WZQAG MKEAZ=MKEAZ BBCHF=BBCHF KRQHE=KRQHE ZZMWH=ZZMWH FHLVH=FHLVH ZFLXM=ZFLXM ===> put TXXPE map size = 10, table length = 4 table[0] = BasicNode LBCDD=LBCDD IKBNU=IKBNU WZQAG=WZQAG MKEAZ=MKEAZ BBCHF=BBCHF KRQHE=KRQHE ZZMWH=ZZMWH FHLVH=FHLVH ZFLXM=ZFLXM TXXPE=TXXPE ===> put NSJDQ map size = 11, table length = 8 table[0] = BasicNode LBCDD=LBCDD IKBNU=IKBNU WZQAG=WZQAG MKEAZ=MKEAZ BBCHF=BBCHF KRQHE=KRQHE ZZMWH=ZZMWH FHLVH=FHLVH ZFLXM=ZFLXM TXXPE=TXXPE NSJDQ=NSJDQ ===> put BXDMJ map size = 12, table length = 16 table[0] = BasicNode LBCDD=LBCDD IKBNU=IKBNU WZQAG=WZQAG MKEAZ=MKEAZ BBCHF=BBCHF KRQHE=KRQHE ZZMWH=ZZMWH FHLVH=FHLVH ZFLXM=ZFLXM TXXPE=TXXPE NSJDQ=NSJDQ BXDMJ=BXDMJ ===> put OFBCR map size = 13, table length = 32 table[0] = BasicNode LBCDD=LBCDD IKBNU=IKBNU WZQAG=WZQAG MKEAZ=MKEAZ BBCHF=BBCHF KRQHE=KRQHE ZZMWH=ZZMWH FHLVH=FHLVH ZFLXM=ZFLXM TXXPE=TXXPE NSJDQ=NSJDQ BXDMJ=BXDMJ OFBCR=OFBCR ===> put WVSIG map size = 14, table length = 64 table[0] = BasicNode LBCDD=LBCDD IKBNU=IKBNU WZQAG=WZQAG MKEAZ=MKEAZ BBCHF=BBCHF KRQHE=KRQHE ZZMWH=ZZMWH FHLVH=FHLVH ZFLXM=ZFLXM TXXPE=TXXPE NSJDQ=NSJDQ BXDMJ=BXDMJ OFBCR=OFBCR WVSIG=WVSIG ===> put HQDXY map size = 15, table length = 64 table[0] = TreeNode LBCDD=LBCDD IKBNU=IKBNU WZQAG=WZQAG MKEAZ=MKEAZ BBCHF=BBCHF KRQHE=KRQHE ZZMWH=ZZMWH FHLVH=FHLVH ZFLXM=ZFLXM TXXPE=TXXPE NSJDQ=NSJDQ BXDMJ=BXDMJ OFBCR=OFBCR WVSIG=WVSIG HQDXY=HQDXY 
like image 62
Stuart Marks Avatar answered Oct 21 '22 05:10

Stuart Marks