I have this code: <pre class="prettyprint"><code>public enum Continent {ASIA, EUROPE} public class Country { private String name; private Continent region; public Country (String na, Continent reg) { this.name = na; this.region = reg; } public String getName () {return name;} public Continent getRegion () {return region;} @Override public String toString() { return "Country [name=" + name + ", region=" + region + "]"; } } </code></pre> and in main class: <pre class="prettyprint"><code>public static void main(String[] args) throws IOException { List<Country> couList = Arrays.asList( new Country ("Japan", Continent.ASIA), new Country ("Sweden", Continent.EUROPE), new Country ("Norway", Continent.EUROPE)); Map<Continent, List<String>> regionNames = couList .stream() //.peek(System.out::println) .collect(Collectors.groupingBy(Country::getRegion, Collectors.mapping(Country::getName, Collectors.toList()))); System.out.println(regionNames); } </code></pre> If I run this code, I get this output: <pre class="prettyprint"><code>{EUROPE=[Sweden, Norway], ASIA=[Japan]} </code></pre> but if I uncomment the <code>peek</code> function, I get this output: <pre class="prettyprint"><code>Country [name=Japan, region=ASIA] Country [name=Sweden, region=EUROPE] Country [name=Norway, region=EUROPE] {ASIA=[Japan], EUROPE=[Sweden, Norway]} </code></pre> My question is, can anybody tell me, why the keys order differs in the map <code>regionNames</code> when the <code>peek</code> function is in place?

The <code>enum</code> implementation of <code>hashCode</code> uses the default one supplied by <code>Object</code>. The documentation of that method mentions: <blockquote> Whenever it is invoked on the same object more than once during an execution of a Java application, the hashCode method must consistently return the same integer, provided no information used in equals comparisons on the object is modified. This integer need not remain consistent from one execution of an application to another execution of the same application. </blockquote> Since the hash code is what determines the order of the buckets inside a <code>HashMap</code> (which is what <code>groupingBy</code> uses), the order changes when the hash code changes. How this hash code is generate is an implementation detail of the VM (as pointed out by Eugene). By commenting, and un-commenting the line with <code>peek</code>, you have simply found a way to influence (reliably or not) this implementation. <hr> Since this question got a bounty, it seems like people are not satisfied with my answer. I will go a little bit more in depth and look at the open-jdk8 implementation (because it's open source) of <code>hashCode</code>. DISCLAIMER: I will state again that the implementation of the identity hash code algorithm is not specified, and can be completely different for a different VM or between different versions of the same VM. Since OP is observing this behaviour, I'll assume that the VM he is using is Hotspot (the Oracle one, which afaik uses the same hashcode implementation as opendjk). But the main point in doing this is to show that commenting or un-commenting a seemingly unrelated line of code can change the order of buckets in a <code>HashMap</code>. This is also one of the reasons why you should never rely on the iteration order of a collection that does not specify one (like <code>HashMap</code>). Now, the actual hashing algorithm for openjdk8 is defined in <code>synchronizer.cpp</code>: <pre class="prettyprint"><code> // Marsaglia's xor-shift scheme with thread-specific state // This is probably the best overall implementation -- we'll // likely make this the default in future releases. unsigned t = Self->_hashStateX ; t ^= (t << 11) ; Self->_hashStateX = Self->_hashStateY ; Self->_hashStateY = Self->_hashStateZ ; Self->_hashStateZ = Self->_hashStateW ; unsigned v = Self->_hashStateW ; v = (v ^ (v >> 19)) ^ (t ^ (t >> 8)) ; Self->_hashStateW = v ; value = v ; </code></pre> As you can see, the hash code is based on these <code>_hashState</code> fields of the <code>Thread</code> object, and the output changes from one call to the next, since the variable values are 'shuffled'. These variables are initialized in the <code>Thread</code> constructor like so: <pre class="prettyprint"><code>_hashStateX = os::random() ; _hashStateY = 842502087 ; _hashStateZ = 0x8767 ; // (int)(3579807591LL & 0xffff) ; _hashStateW = 273326509 ; </code></pre> The only moving part here is <code>os::random</code>, which is defined in <code>os.cpp</code> which has a comment describing the algorithm as: <pre class="prettyprint"><code>next_rand = (16807*seed) mod (2**31-1) </code></pre> This <code>seed</code> is the only moving part, and it is defined by <code>_rand_seed</code>, and initialized through a function called <code>init_random</code>, and at the end of the function, the returned value is used as a seed for the next call. A <code>grep</code> through the repo shows this: <pre class="prettyprint"><code>PS $> grep -r init_random os/bsd/vm/os_bsd.cpp: init_random(1234567); os/linux/vm/os_linux.cpp: init_random(1234567); os/solaris/vm/os_solaris.cpp: init_random(1234567); os/windows/vm/os_windows.cpp: init_random(1234567); ... test methods </code></pre> It looks like the initial seed is a constant on the platform I'm testing (windows). <hr> From that, I come to the conclusion that a generated identity hash code (in openjdk-8), changes based on how many identity hash codes have been generated on the same thread before it and by how many times <code>os::random</code> was called before the thread generating the hash code was instantiated, which stays the same for the example program. We can already kind of see this because the order of the keys doesn't change from run to run of the program, if the program stays the same. But another way to see it, is to put <code>System.out.println(new Object().hashCode());</code> at the beginning of the <code>main</code> method, and see that the output is always the same if you run the program a few times. You will also notice that generating identity hash codes before the stream calls will also change the hash codes of the enum constants, and thus can change the order of the buckets in the map. Now, let's go back to the Java example. If the identity hash code of an enum constant changes based on how many identity hash codes have been generated before it, the logical conclusion would be that somewhere in the call to <code>peek</code>, an identity hash code is generated, which changes the hash codes that are generated for the enum constant on the line with <code>collect</code> afterwards: <pre class="prettyprint"><code>Map<Continent, List<String>> regionNames = couList .stream() //.peek(System.out::println) // Does this call Object.hashCode? .collect(Collectors.groupingBy(Country::getRegion, Collectors.mapping(Country::getName, Collectors.toList()))); // hash code for constant generated here </code></pre> You can see this with an ordinary Java debugger. I've placed a breakpoint on <code>Object#hashCode</code>, and waited to see if the line with <code>peek</code> calls it. (If you try this yourself, I would note that the VM uses <code>HashMap</code> itself, and will call <code>hashCode</code> a number of times before the <code>main</code> method. So be aware of that) Et voila: <pre class="prettyprint"><code>Object.hashCode() line: not available [native method] HashMap<K,V>.hash(Object) line: 338 HashMap<K,V>.put(K, V) line: 611 HashSet<E>.add(E) line: 219 Collections$SynchronizedSet<E>(Collections$SynchronizedCollection<E>).add(E) line: 2035 Launcher$AppClassLoader(ClassLoader).checkPackageAccess(Class<?>, ProtectionDomain) line: 508 Main.main(String...) line: 19 </code></pre> The line with <code>peek</code> calls <code>hashCode</code> on a <code>ProtectionDomain</code> object that's used by the class loader that loads the <code>LambdaMetafactory</code> class (which is the <code>Class<?></code> that you see, I can get the value from my debugger). The <code>hashCode</code> method is actually called a bunch of times (maybe a few hundred?), for the line with <code>peek</code>, throughout the MethodHandle framework. So, since the line with <code>peek</code>, calls <code>Object#hashCode</code>, before the hash codes for the enum constants are generated (also by calling <code>Object#hashCode</code>), the hash codes of the constants change. So adding or removing the line with <code>peek</code> changes the hash codes of the constants, which changes the order of the buckets in the map. One final way to confirm, is to generate the hash codes of the constants before the line with <code>peek</code>, by adding: <pre class="prettyprint"><code>Continent.ASIA.hashCode(); Continent.EUROPE.hashCode(); </code></pre> To the beginning of the <code>main</code> method. You will now see that commenting or un-commenting the line with <code>peek</code> has no effect on the order of the buckets.

Why Stream<T> collect method returns different key order?

Tags:

java

lambda

java-8

java-stream

I have this code:

public enum Continent {ASIA, EUROPE}

public class Country {      
   private String name;
   private Continent region;

    public Country (String na, Continent reg) { 
        this.name = na;
        this.region = reg;
    }
    public String getName () {return name;} 
    public Continent getRegion () {return region;}
    @Override
    public String toString() {
        return "Country [name=" + name + ", region=" + region + "]";
    }
}

and in main class:

public static void main(String[] args) throws IOException {
        List<Country> couList = Arrays.asList(
            new Country ("Japan", Continent.ASIA), 
            new Country ("Sweden", Continent.EUROPE), 
            new Country ("Norway", Continent.EUROPE));
        Map<Continent, List<String>> regionNames = couList
                .stream()
                //.peek(System.out::println)
                .collect(Collectors.groupingBy(Country::getRegion, Collectors.mapping(Country::getName, Collectors.toList())));
        System.out.println(regionNames);
}

If I run this code, I get this output:

{EUROPE=[Sweden, Norway], ASIA=[Japan]}

but if I uncomment the peek function, I get this output:

Country [name=Japan, region=ASIA]
Country [name=Sweden, region=EUROPE]
Country [name=Norway, region=EUROPE]
{ASIA=[Japan], EUROPE=[Sweden, Norway]}

My question is, can anybody tell me, why the keys order differs in the map regionNames when the peek function is in place?

885

asked May 19 '17 09:05

polis

1 Answers

The enum implementation of hashCode uses the default one supplied by Object. The documentation of that method mentions:

Whenever it is invoked on the same object more than once during an execution of a Java application, the hashCode method must consistently return the same integer, provided no information used in equals comparisons on the object is modified. This integer need not remain consistent from one execution of an application to another execution of the same application.

Since the hash code is what determines the order of the buckets inside a HashMap (which is what groupingBy uses), the order changes when the hash code changes. How this hash code is generate is an implementation detail of the VM (as pointed out by Eugene). By commenting, and un-commenting the line with peek, you have simply found a way to influence (reliably or not) this implementation.

Since this question got a bounty, it seems like people are not satisfied with my answer. I will go a little bit more in depth and look at the open-jdk8 implementation (because it's open source) of hashCode. DISCLAIMER: I will state again that the implementation of the identity hash code algorithm is not specified, and can be completely different for a different VM or between different versions of the same VM. Since OP is observing this behaviour, I'll assume that the VM he is using is Hotspot (the Oracle one, which afaik uses the same hashcode implementation as opendjk). But the main point in doing this is to show that commenting or un-commenting a seemingly unrelated line of code can change the order of buckets in a HashMap. This is also one of the reasons why you should never rely on the iteration order of a collection that does not specify one (like HashMap).

Now, the actual hashing algorithm for openjdk8 is defined in synchronizer.cpp:

 // Marsaglia's xor-shift scheme with thread-specific state
 // This is probably the best overall implementation -- we'll
 // likely make this the default in future releases.
 unsigned t = Self->_hashStateX ;
 t ^= (t << 11) ;
 Self->_hashStateX = Self->_hashStateY ;
 Self->_hashStateY = Self->_hashStateZ ;
 Self->_hashStateZ = Self->_hashStateW ;
 unsigned v = Self->_hashStateW ;
 v = (v ^ (v >> 19)) ^ (t ^ (t >> 8)) ;
 Self->_hashStateW = v ;
 value = v ;

As you can see, the hash code is based on these _hashState fields of the Thread object, and the output changes from one call to the next, since the variable values are 'shuffled'.

These variables are initialized in the Thread constructor like so:

_hashStateX = os::random() ;
_hashStateY = 842502087 ;
_hashStateZ = 0x8767 ;    // (int)(3579807591LL & 0xffff) ;
_hashStateW = 273326509 ;

The only moving part here is os::random, which is defined in os.cpp which has a comment describing the algorithm as:

next_rand = (16807*seed) mod (2**31-1)

This seed is the only moving part, and it is defined by _rand_seed, and initialized through a function called init_random, and at the end of the function, the returned value is used as a seed for the next call. A grep through the repo shows this:

PS $> grep -r init_random
os/bsd/vm/os_bsd.cpp:  init_random(1234567);
os/linux/vm/os_linux.cpp:  init_random(1234567);
os/solaris/vm/os_solaris.cpp:  init_random(1234567);
os/windows/vm/os_windows.cpp:  init_random(1234567);
... test methods

It looks like the initial seed is a constant on the platform I'm testing (windows).

From that, I come to the conclusion that a generated identity hash code (in openjdk-8), changes based on how many identity hash codes have been generated on the same thread before it and by how many times os::random was called before the thread generating the hash code was instantiated, which stays the same for the example program. We can already kind of see this because the order of the keys doesn't change from run to run of the program, if the program stays the same. But another way to see it, is to put System.out.println(new Object().hashCode()); at the beginning of the main method, and see that the output is always the same if you run the program a few times.

You will also notice that generating identity hash codes before the stream calls will also change the hash codes of the enum constants, and thus can change the order of the buckets in the map.

Now, let's go back to the Java example. If the identity hash code of an enum constant changes based on how many identity hash codes have been generated before it, the logical conclusion would be that somewhere in the call to peek, an identity hash code is generated, which changes the hash codes that are generated for the enum constant on the line with collect afterwards:

Map<Continent, List<String>> regionNames = couList
        .stream()
        //.peek(System.out::println) // Does this call Object.hashCode?
        .collect(Collectors.groupingBy(Country::getRegion,
            Collectors.mapping(Country::getName, Collectors.toList()))); // hash code for constant generated here

You can see this with an ordinary Java debugger. I've placed a breakpoint on Object#hashCode, and waited to see if the line with peek calls it. (If you try this yourself, I would note that the VM uses HashMap itself, and will call hashCode a number of times before the main method. So be aware of that)

Et voila:

Object.hashCode() line: not available [native method]   
HashMap<K,V>.hash(Object) line: 338 
HashMap<K,V>.put(K, V) line: 611    
HashSet<E>.add(E) line: 219 
Collections$SynchronizedSet<E>(Collections$SynchronizedCollection<E>).add(E) line: 2035 
Launcher$AppClassLoader(ClassLoader).checkPackageAccess(Class<?>, ProtectionDomain) line: 508   
Main.main(String...) line: 19

The line with peek calls hashCode on a ProtectionDomain object that's used by the class loader that loads the LambdaMetafactory class (which is the Class<?> that you see, I can get the value from my debugger). The hashCode method is actually called a bunch of times (maybe a few hundred?), for the line with peek, throughout the MethodHandle framework.

So, since the line with peek, calls Object#hashCode, before the hash codes for the enum constants are generated (also by calling Object#hashCode), the hash codes of the constants change. So adding or removing the line with peek changes the hash codes of the constants, which changes the order of the buckets in the map.

One final way to confirm, is to generate the hash codes of the constants before the line with peek, by adding:

Continent.ASIA.hashCode();
Continent.EUROPE.hashCode();

To the beginning of the main method.

You will now see that commenting or un-commenting the line with peek has no effect on the order of the buckets.

answered Sep 23 '22 08:09

Jorn Vernee

Related questions
                            
                                Regex to match if string *only* contains *all* characters from a character set, plus an optional one
                            
                                Best approach for linking diverse entity types in JPA
                            
                                Dagger 2 - Constructor injection - Scope
                            
                                Combine Realm.io and Firebase
                            
                                Spring Cloud Config Server Priority of Environment Variables
                            
                                Java SSL Server disable weak elliptic curves
                            
                                Selenium: unknown error: session deleted because of page crash
                            
                                Apache Commons CLI: how to prevent using short-name for options?
                            
                                How to handle post data of size more than 2 mb
                            
                                Log4j logging twice with different formats
                            
                                Optional injection Dagger 2
                            
                                Java 9 - REST with Spring 5 & Jigsaw - Is it possible?
                            
                                How to read file from src/main/resources with annotation processor?
                            
                                How to document args in Java main
                            
                                How to retrieve the key with a maximum value in a TreeMap in Java?
                            
                                Spring-Boot one @Scheduled task using multiple cron expressions from yaml file
                            
                                Jaxb UnMarshal Error : unexpected element (uri:"", local:"processedSalesOrderTypeList"). Expected elements are
                            
                                What is http-remoting Protocol
                            
                                Apache Kafka order windowed messages based on their value
                            
                                Enforce constraints on @Value annotated field in Spring Boot application

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With