Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why Stream<T> collect method returns different key order?

I have this code:

public enum Continent {ASIA, EUROPE}

public class Country {      
   private String name;
   private Continent region;

    public Country (String na, Continent reg) { 
        this.name = na;
        this.region = reg;
    }
    public String getName () {return name;} 
    public Continent getRegion () {return region;}
    @Override
    public String toString() {
        return "Country [name=" + name + ", region=" + region + "]";
    }
}

and in main class:

public static void main(String[] args) throws IOException {
        List<Country> couList = Arrays.asList(
            new Country ("Japan", Continent.ASIA), 
            new Country ("Sweden", Continent.EUROPE), 
            new Country ("Norway", Continent.EUROPE));
        Map<Continent, List<String>> regionNames = couList
                .stream()
                //.peek(System.out::println)
                .collect(Collectors.groupingBy(Country::getRegion, Collectors.mapping(Country::getName, Collectors.toList())));
        System.out.println(regionNames);
}

If I run this code, I get this output:

{EUROPE=[Sweden, Norway], ASIA=[Japan]}

but if I uncomment the peek function, I get this output:

Country [name=Japan, region=ASIA]
Country [name=Sweden, region=EUROPE]
Country [name=Norway, region=EUROPE]
{ASIA=[Japan], EUROPE=[Sweden, Norway]}

My question is, can anybody tell me, why the keys order differs in the map regionNames when the peek function is in place?

like image 885
polis Avatar asked May 19 '17 09:05

polis


People also ask

Does stream Map maintain order?

A parallel stream is performed one or more elements at a time. Thus the map() would preserve the encounter of the stream order but not the original List's order.

What stream collect returns?

We can use Stream collect() function to perform a mutable reduction operation and concatenate the list elements. The supplier function is returning a new StringBuilder object in every call. The accumulator function is appending the list string element to the StringBuilder instance.

What is the use of collect in Java 8?

collect() is one of the Java 8's Stream API's terminal methods. It allows us to perform mutable fold operations (repackaging elements to some data structures and applying some additional logic, concatenating them, etc.) on data elements held in a Stream instance.

What is collect stream API?

The collect() method in Stream API collects all objects from a stream object and stored in the type of collection. The user has to provide what type of collection the results can be stored. We specify the collection type using the Collectors Enum.


1 Answers

The enum implementation of hashCode uses the default one supplied by Object. The documentation of that method mentions:

Whenever it is invoked on the same object more than once during an execution of a Java application, the hashCode method must consistently return the same integer, provided no information used in equals comparisons on the object is modified. This integer need not remain consistent from one execution of an application to another execution of the same application.

Since the hash code is what determines the order of the buckets inside a HashMap (which is what groupingBy uses), the order changes when the hash code changes. How this hash code is generate is an implementation detail of the VM (as pointed out by Eugene). By commenting, and un-commenting the line with peek, you have simply found a way to influence (reliably or not) this implementation.


Since this question got a bounty, it seems like people are not satisfied with my answer. I will go a little bit more in depth and look at the open-jdk8 implementation (because it's open source) of hashCode. DISCLAIMER: I will state again that the implementation of the identity hash code algorithm is not specified, and can be completely different for a different VM or between different versions of the same VM. Since OP is observing this behaviour, I'll assume that the VM he is using is Hotspot (the Oracle one, which afaik uses the same hashcode implementation as opendjk). But the main point in doing this is to show that commenting or un-commenting a seemingly unrelated line of code can change the order of buckets in a HashMap. This is also one of the reasons why you should never rely on the iteration order of a collection that does not specify one (like HashMap).

Now, the actual hashing algorithm for openjdk8 is defined in synchronizer.cpp:

 // Marsaglia's xor-shift scheme with thread-specific state
 // This is probably the best overall implementation -- we'll
 // likely make this the default in future releases.
 unsigned t = Self->_hashStateX ;
 t ^= (t << 11) ;
 Self->_hashStateX = Self->_hashStateY ;
 Self->_hashStateY = Self->_hashStateZ ;
 Self->_hashStateZ = Self->_hashStateW ;
 unsigned v = Self->_hashStateW ;
 v = (v ^ (v >> 19)) ^ (t ^ (t >> 8)) ;
 Self->_hashStateW = v ;
 value = v ;

As you can see, the hash code is based on these _hashState fields of the Thread object, and the output changes from one call to the next, since the variable values are 'shuffled'.

These variables are initialized in the Thread constructor like so:

_hashStateX = os::random() ;
_hashStateY = 842502087 ;
_hashStateZ = 0x8767 ;    // (int)(3579807591LL & 0xffff) ;
_hashStateW = 273326509 ;

The only moving part here is os::random, which is defined in os.cpp which has a comment describing the algorithm as:

next_rand = (16807*seed) mod (2**31-1)

This seed is the only moving part, and it is defined by _rand_seed, and initialized through a function called init_random, and at the end of the function, the returned value is used as a seed for the next call. A grep through the repo shows this:

PS $> grep -r init_random
os/bsd/vm/os_bsd.cpp:  init_random(1234567);
os/linux/vm/os_linux.cpp:  init_random(1234567);
os/solaris/vm/os_solaris.cpp:  init_random(1234567);
os/windows/vm/os_windows.cpp:  init_random(1234567);
... test methods

It looks like the initial seed is a constant on the platform I'm testing (windows).


From that, I come to the conclusion that a generated identity hash code (in openjdk-8), changes based on how many identity hash codes have been generated on the same thread before it and by how many times os::random was called before the thread generating the hash code was instantiated, which stays the same for the example program. We can already kind of see this because the order of the keys doesn't change from run to run of the program, if the program stays the same. But another way to see it, is to put System.out.println(new Object().hashCode()); at the beginning of the main method, and see that the output is always the same if you run the program a few times.

You will also notice that generating identity hash codes before the stream calls will also change the hash codes of the enum constants, and thus can change the order of the buckets in the map.

Now, let's go back to the Java example. If the identity hash code of an enum constant changes based on how many identity hash codes have been generated before it, the logical conclusion would be that somewhere in the call to peek, an identity hash code is generated, which changes the hash codes that are generated for the enum constant on the line with collect afterwards:

Map<Continent, List<String>> regionNames = couList
        .stream()
        //.peek(System.out::println) // Does this call Object.hashCode?
        .collect(Collectors.groupingBy(Country::getRegion,
            Collectors.mapping(Country::getName, Collectors.toList()))); // hash code for constant generated here

You can see this with an ordinary Java debugger. I've placed a breakpoint on Object#hashCode, and waited to see if the line with peek calls it. (If you try this yourself, I would note that the VM uses HashMap itself, and will call hashCode a number of times before the main method. So be aware of that)

Et voila:

Object.hashCode() line: not available [native method]   
HashMap<K,V>.hash(Object) line: 338 
HashMap<K,V>.put(K, V) line: 611    
HashSet<E>.add(E) line: 219 
Collections$SynchronizedSet<E>(Collections$SynchronizedCollection<E>).add(E) line: 2035 
Launcher$AppClassLoader(ClassLoader).checkPackageAccess(Class<?>, ProtectionDomain) line: 508   
Main.main(String...) line: 19   

The line with peek calls hashCode on a ProtectionDomain object that's used by the class loader that loads the LambdaMetafactory class (which is the Class<?> that you see, I can get the value from my debugger). The hashCode method is actually called a bunch of times (maybe a few hundred?), for the line with peek, throughout the MethodHandle framework.

So, since the line with peek, calls Object#hashCode, before the hash codes for the enum constants are generated (also by calling Object#hashCode), the hash codes of the constants change. So adding or removing the line with peek changes the hash codes of the constants, which changes the order of the buckets in the map.

One final way to confirm, is to generate the hash codes of the constants before the line with peek, by adding:

Continent.ASIA.hashCode();
Continent.EUROPE.hashCode();

To the beginning of the main method.

You will now see that commenting or un-commenting the line with peek has no effect on the order of the buckets.

like image 74
Jorn Vernee Avatar answered Sep 23 '22 08:09

Jorn Vernee