Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Iterating through the union of several Java Map key sets efficiently

Tags:

java

key

union

set

map

In one of my Java 6 projects I have an array of LinkedHashMap instances as input to a method which has to iterate through all keys (i.e. through the union of the key sets of all maps) and work with the associated values. Not all keys exist in all maps and the method should not go through each key more than once or alter the input maps.

My current implementation looks like this:

Set<Object> keyset = new HashSet<Object>();

for (Map<Object, Object> map : input) {
    for (Object key : map.keySet()) {
        if (keyset.add(key)) {
            ...
        }
    }
}

The HashSet instance ensures that no key will be acted upon more than once.

Unfortunately this part of the code is rather critical performance-wise, as it is called very frequently. In fact, according to the profiler over 10% of the CPU time is spent in the HashSet.add() method.

I am trying to optimise this code us much as possible. The use of LinkedHashMap with its more efficient iterators (in comparison to the plain HashMap) was a significant boost, but I was hoping to reduce what is essentially book-keeping time to the minimum.

Putting all the keys in the HashSet before-hand, by using addAll() proved to be less efficient, due to the cost of calling HashSet.contains() afterwards. At the moment I am looking at whether I can use a bitmap (well, a boolean[] to be exact) to avoid the HashSet completely, but it may not be possible at all, depending on my key range.

Is there a more efficient way to do this? Preferrably something that will not pose restrictions on the keys?

EDIT:

A few clarifications and comments:

  • I do need all the values from the maps - I cannot drop any of them.

  • I also need to know which map each value came from. The missing part (...) in my code would be something like this:

    for (Map<Object, Object> m : input) {
        Object v = m.get(key);
    
        // Do something with v
    }
    

    A simple example to get an idea of what I need to do with the maps would be to print all maps in parallel like this:

    Key Map0 Map1 Map2
    F   1    null 2
    B   2    3    null
    C   null null 5
    ...
    

    That's not what I am actually doing, but you should get the idea.

  • The input maps are extremely variable. In fact, each call of this method uses a different set of them. Therefore I would not gain anything by caching the union of their keys.

  • My keys are all String instances. They are sort-of-interned on the heap using a separate HashMap, since they are pretty repetitive, therefore their hash code is already cached and most hash validations (when the HashMap implementation is checking whether two keys are actually equal, after their hash codes match) boil down to an identity comparison (==). The profiler confirms that only 0.5% of the CPU time is spent on String.equals() and String.hashCode().

EDIT 2:

Based on the suggestions in the answers, I made a few tests, profiling and benchmarking along the way. I ended up with roughly a 7% increase in performance. What I did:

  • I set the initial capacity of the HashSet to double the collective size of all input maps. This gained me something in the region of 1-2%, by eliminating most (all?) resize() calls in the HashSet.

  • I used Map.entrySet() for the map I am currently iterating. I had originally avoided this approach due to the additional code and the fear that the extra checks and Map.Entry getter method calls would outweigh any advantages. It turned out that the overall code was slightly faster.

  • I am sure that some people will start screaming at me, but here it is: Raw types. More specifically I used the raw form of HashSet in the code above. Since I was already using Object as its content type, I do not lose any type safety. The cost of that useless checkcast operation when calling HashSet.add() was apparently important enough to produce a 4% increase in performance when removed. Why the JVM insists on checking casts to Object is beyond me...

like image 316
thkala Avatar asked Jun 29 '11 08:06

thkala


2 Answers

Can't provide a replacement for your approach but a few suggestions to (slightly) optimize the existing code.

  1. Consider initializing the hash set with a capacity (the sum of the sizes of all maps). This avoids/reduces resizing of the set during an add operation
  2. Consider not using the keySet() as it will always create a new set in the background. Use the entrySet(), that should be much faster
  3. Have a look at the implementations of equals() and hashCode() - if they are "expensive", then you have a negative impact on the add method.
like image 187
Andreas Dolk Avatar answered Sep 27 '22 19:09

Andreas Dolk


How you avoid using a HashSet depends on what you are doing.

I would only calculate the union once each time the input is changed. This should be relatively rare conmpared with the number of lookups.

// on an update.
Map<Key, Value> union = new LinkedHashMap<Key, Value>();
for (Map<Key, Value> map : input) 
    union.putAll(map);


// on a lookup.
Value value = union.get(key);
// process each key once
for(Entry<Key, Value> entry: union) {
   // do something.
}
like image 43
Peter Lawrey Avatar answered Sep 27 '22 17:09

Peter Lawrey