I have this code:
public enum Continent {ASIA, EUROPE}
public class Country {
private String name;
private Continent region;
public Country (String na, Continent reg) {
this.name = na;
this.region = reg;
}
public String getName () {return name;}
public Continent getRegion () {return region;}
@Override
public String toString() {
return "Country [name=" + name + ", region=" + region + "]";
}
}
and in main class:
public static void main(String[] args) throws IOException {
List<Country> couList = Arrays.asList(
new Country ("Japan", Continent.ASIA),
new Country ("Sweden", Continent.EUROPE),
new Country ("Norway", Continent.EUROPE));
Map<Continent, List<String>> regionNames = couList
.stream()
//.peek(System.out::println)
.collect(Collectors.groupingBy(Country::getRegion, Collectors.mapping(Country::getName, Collectors.toList())));
System.out.println(regionNames);
}
If I run this code, I get this output:
{EUROPE=[Sweden, Norway], ASIA=[Japan]}
but if I uncomment the peek
function, I get this output:
Country [name=Japan, region=ASIA]
Country [name=Sweden, region=EUROPE]
Country [name=Norway, region=EUROPE]
{ASIA=[Japan], EUROPE=[Sweden, Norway]}
My question is, can anybody tell me, why the keys order differs in the map regionNames
when the peek
function is in place?
A parallel stream is performed one or more elements at a time. Thus the map() would preserve the encounter of the stream order but not the original List's order.
We can use Stream collect() function to perform a mutable reduction operation and concatenate the list elements. The supplier function is returning a new StringBuilder object in every call. The accumulator function is appending the list string element to the StringBuilder instance.
collect() is one of the Java 8's Stream API's terminal methods. It allows us to perform mutable fold operations (repackaging elements to some data structures and applying some additional logic, concatenating them, etc.) on data elements held in a Stream instance.
The collect() method in Stream API collects all objects from a stream object and stored in the type of collection. The user has to provide what type of collection the results can be stored. We specify the collection type using the Collectors Enum.
The enum
implementation of hashCode
uses the default one supplied by Object
. The documentation of that method mentions:
Whenever it is invoked on the same object more than once during an execution of a Java application, the hashCode method must consistently return the same integer, provided no information used in equals comparisons on the object is modified. This integer need not remain consistent from one execution of an application to another execution of the same application.
Since the hash code is what determines the order of the buckets inside a HashMap
(which is what groupingBy
uses), the order changes when the hash code changes. How this hash code is generate is an implementation detail of the VM (as pointed out by Eugene). By commenting, and un-commenting the line with peek
, you have simply found a way to influence (reliably or not) this implementation.
Since this question got a bounty, it seems like people are not satisfied with my answer. I will go a little bit more in depth and look at the open-jdk8 implementation (because it's open source) of hashCode
. DISCLAIMER: I will state again that the implementation of the identity hash code algorithm is not specified, and can be completely different for a different VM or between different versions of the same VM. Since OP is observing this behaviour, I'll assume that the VM he is using is Hotspot (the Oracle one, which afaik uses the same hashcode implementation as opendjk). But the main point in doing this is to show that commenting or un-commenting a seemingly unrelated line of code can change the order of buckets in a HashMap
. This is also one of the reasons why you should never rely on the iteration order of a collection that does not specify one (like HashMap
).
Now, the actual hashing algorithm for openjdk8 is defined in synchronizer.cpp
:
// Marsaglia's xor-shift scheme with thread-specific state
// This is probably the best overall implementation -- we'll
// likely make this the default in future releases.
unsigned t = Self->_hashStateX ;
t ^= (t << 11) ;
Self->_hashStateX = Self->_hashStateY ;
Self->_hashStateY = Self->_hashStateZ ;
Self->_hashStateZ = Self->_hashStateW ;
unsigned v = Self->_hashStateW ;
v = (v ^ (v >> 19)) ^ (t ^ (t >> 8)) ;
Self->_hashStateW = v ;
value = v ;
As you can see, the hash code is based on these _hashState
fields of the Thread
object, and the output changes from one call to the next, since the variable values are 'shuffled'.
These variables are initialized in the Thread
constructor like so:
_hashStateX = os::random() ;
_hashStateY = 842502087 ;
_hashStateZ = 0x8767 ; // (int)(3579807591LL & 0xffff) ;
_hashStateW = 273326509 ;
The only moving part here is os::random
, which is defined in os.cpp
which has a comment describing the algorithm as:
next_rand = (16807*seed) mod (2**31-1)
This seed
is the only moving part, and it is defined by _rand_seed
, and initialized through a function called init_random
, and at the end of the function, the returned value is used as a seed for the next call. A grep
through the repo shows this:
PS $> grep -r init_random
os/bsd/vm/os_bsd.cpp: init_random(1234567);
os/linux/vm/os_linux.cpp: init_random(1234567);
os/solaris/vm/os_solaris.cpp: init_random(1234567);
os/windows/vm/os_windows.cpp: init_random(1234567);
... test methods
It looks like the initial seed is a constant on the platform I'm testing (windows).
From that, I come to the conclusion that a generated identity hash code (in openjdk-8), changes based on how many identity hash codes have been generated on the same thread before it and by how many times os::random
was called before the thread generating the hash code was instantiated, which stays the same for the example program. We can already kind of see this because the order of the keys doesn't change from run to run of the program, if the program stays the same. But another way to see it, is to put System.out.println(new Object().hashCode());
at the beginning of the main
method, and see that the output is always the same if you run the program a few times.
You will also notice that generating identity hash codes before the stream calls will also change the hash codes of the enum constants, and thus can change the order of the buckets in the map.
Now, let's go back to the Java example. If the identity hash code of an enum constant changes based on how many identity hash codes have been generated before it, the logical conclusion would be that somewhere in the call to peek
, an identity hash code is generated, which changes the hash codes that are generated for the enum constant on the line with collect
afterwards:
Map<Continent, List<String>> regionNames = couList
.stream()
//.peek(System.out::println) // Does this call Object.hashCode?
.collect(Collectors.groupingBy(Country::getRegion,
Collectors.mapping(Country::getName, Collectors.toList()))); // hash code for constant generated here
You can see this with an ordinary Java debugger. I've placed a breakpoint on Object#hashCode
, and waited to see if the line with peek
calls it. (If you try this yourself, I would note that the VM uses HashMap
itself, and will call hashCode
a number of times before the main
method. So be aware of that)
Et voila:
Object.hashCode() line: not available [native method]
HashMap<K,V>.hash(Object) line: 338
HashMap<K,V>.put(K, V) line: 611
HashSet<E>.add(E) line: 219
Collections$SynchronizedSet<E>(Collections$SynchronizedCollection<E>).add(E) line: 2035
Launcher$AppClassLoader(ClassLoader).checkPackageAccess(Class<?>, ProtectionDomain) line: 508
Main.main(String...) line: 19
The line with peek
calls hashCode
on a ProtectionDomain
object that's used by the class loader that loads the LambdaMetafactory
class (which is the Class<?>
that you see, I can get the value from my debugger). The hashCode
method is actually called a bunch of times (maybe a few hundred?), for the line with peek
, throughout the MethodHandle framework.
So, since the line with peek
, calls Object#hashCode
, before the hash codes for the enum constants are generated (also by calling Object#hashCode
), the hash codes of the constants change. So adding or removing the line with peek
changes the hash codes of the constants, which changes the order of the buckets in the map.
One final way to confirm, is to generate the hash codes of the constants before the line with peek
, by adding:
Continent.ASIA.hashCode();
Continent.EUROPE.hashCode();
To the beginning of the main
method.
You will now see that commenting or un-commenting the line with peek
has no effect on the order of the buckets.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With