Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

JDK9 randomization on immutable sets and maps

Reading this question and the answer given by Eugene, I found that JDK9 immutable sets and maps will introduce a source of randomness that will affect their traversal. This means that iteration order will indeed be random, at least among different runs of the JVM.

As the spec doesn't guarantee any traversal/iteration order for sets and maps, this is absolutely fine. In fact, code must never rely on implementation-specific details, but on the spec instead.

I know that today, with JDK 8, if I have i.e. a HashSet and do this (taken from the linked answer):

Set<String> wordSet = new HashSet<>(Arrays.asList("just", "a", "test"));

System.out.println(wordSet);

for (int i = 0; i < 100; i++) {
    wordSet.add("" + i);
}

for (int i = 0; i < 100; i++) {
    wordSet.remove("" + i);
}

System.out.println(wordSet);

Then the iteration order of the elements will change and the two outputs will differ. This is because adding and removing 100 elements to the set changes the internal capacity of the HashSet and rehashes elements. And this is perfectly valid behavior. I'm not asking about this here.

However, with JDK9, if I do this:

Set<String> set = Set.of("just", "a", "test");
System.out.println(set);

And then, in another instance of the JVM, I run the same code, the outputs can be different, because randomization has been introduced.

So far, I've found this excellent video in youtube (minute 44:55), in which Stuart Marks says that one motivation for this randomization is:

(...) that people write applications that have inadvertent dependencies on iteration order. (...) So, anyway, iteration order is a big deal and I think there's a lot of code out there that has latent dependencies on iteration order that has not been discovered yet. (...) So, our response to this is to deliberately randomize the iteration order in Set and Map in the new collections. So whereas before the iteration order of collections was unpredictable but stable, these are predictably unpredictable. So every time the JVM starts up, we get a random number and we use that at as a seed value that gets mixed in with the hash values. So, if you run a program that initializes a set and then prints out the elements in any order, you get an answer, and then, if you invoke the JVM again and run that same program, the set of elements usually would come out in a different order. So, the idea here is that (...) if there are iteration order dependencies in your code, what used to happen in the past, is a new JDK release came out and you test your code and (...) it'd take hours of debugging to trace it down to some kind of change in iteration order. What that meant was there was a bug in that code that depended on the iteration order. Now, if you vary the iteration order more often, like every JVM invocation, then (we hope) that weird behavior will manifest itself more frequently, and in fact we hope while you're doing testing...

So, the motivation is clear, and it's also clear that this randomization will only affect the new immutable sets and maps.

My question is: Are there any other motivations for this randomization and what advantages does it have?

like image 663
fps Avatar asked Jul 20 '17 18:07

fps


1 Answers

Well it turns out there is another reason for the randomized iteration order. It's not a big secret or anything. I thought I had explained it in that talk, but maybe not. I probably mentioned it on the OpenJDK mailing lists or perhaps in internal discussions.

In any case, another reason for randomized iteration order is to preserve flexibility for future implementation changes.

This turns out to be a bigger deal than most people think. Historically, HashSet and HashMap have never specified a particular iteration order. From time to time, however, the implementation needed to change, to improve performance or to fix bugs. Any change to iteration order generated a lot of flak from users. Over the years, a lot of resistance built up to changing iteration order, and this made maintenance of HashMap more difficult.

To see why this is a problem, consider a spectrum of different policies for managing the stability of iteration order:

  1. Specify the iteration order, and stick to it.

  2. Leave iteration order unspecified, but implicitly keep iteration order stable.

  3. Leave iteration order unspecified, but change the iteration order as little as possible.

  4. Change the iteration order frequently, e.g., in update releases.

  5. Change the iteration order more frequently, e.g., from one run of the JVM to the next.

  6. Change the iteration order even more frequently, e.g., from one iteration to the next.

When collections were introduced in JDK 1.2, HashMap iteration order was unspecified. Stable iteration order was provided by LinkedHashMap at a somewhat higher cost. If you didn't need a stable iteration order, you shouldn't have to pay for it. This ruled out #1 and #2.

For the next several releases, we tried to keep iteration order stable, even though the specification allowed it to change. Nobody likes it when code breaks, and it's pretty unpleasant to have to tell a customer that his code is broken because it depends on iteration order.

So we ended up with policy #3, keeping iteration order as stable as possible, although it did change from time to time. For example, we introduced alternative hashing in JDK 7u6 (code review for JDK-7118743) and tree bins in JDK 8 (JEP 180), and both changed HashMap iteration order in some circumstances. Ordering also changed a couple times in earlier releases. Somebody did some archaeology and found that iteration order changed an average of once per major JDK release.

This was the worst of all possible worlds. Major releases only occurred once every couple years. When one came out, everybody's code would break. There would be much wailing and gnashing of teeth, people would fix their code, and we'd promise to never, ever change iteration order again. A couple years would go by and new code would be written that inadvertently depended on iteration order. Then we'd come out with another major release that changed the iteration order, and this would break everybody's code again. And the cycle would begin anew.

I wanted to avoid repeating this cycle for the new collections. Instead of keeping iteration order as stable as possible, I pursued a policy of changing it as frequently as possible. Initially the order changed on every iteration, but this imposed some overhead. Eventually we settled on once per JVM invocation. The cost is a 32-bit XOR operation per table probe, which I think is pretty cheap.

To a certain extent this is about "toughening up" application code. If changing iteration order breaks code, then breaking that code more frequently will cause it to develop resistance that kind of breakage. Of course, the code doesn't get stronger by itself; it requires more developer effort for this to happen. And people will quite reasonably complain about having to do this additional work.

But the "toughening" of application code is in some sense secondary to the other goal of preserving the freedom to change the implementation. Preserving iteration order of HashMap has made it more difficult to maintain. Randomized iteration order in the new collections means that we needn't worry about preserving iteration order when modifying them, so they're easier to maintain and enhance.

For example, the current implementation (Java 9, pre-GA, July 2017) has three field-based implementations of Set (Set0, Set1, and Set2) and an array-based implementation (SetN) that uses a simple closed hashing with linear probing scheme. In the future, we might want to add a Set3 implementation that holds three elements in three fields. Or, we might want to change the collision resolution policy of SetN from linear probing to something more sophisticated. We can completely restructure the implementation, even in minor releases, if we don't have to deal with preserving iteration order.

In summary, the tradeoff is that application developers have to do more work to make sure their code resists breakage from iteration order change. This is likely work they'd have to do at some point anyway with HashMap. What's gained by this is more opportunities for the JDK to deliver improved performance and space efficiency, from which everybody can benefit.

like image 110
Stuart Marks Avatar answered Sep 28 '22 03:09

Stuart Marks