This is what I think is happening (keep in mind I'm not familiar with HotSpot):
0xf36c9fd0: mov 0x6c(%ecx),%ebp ; vfoo
0xf36c9fd3: test %ebp,%ebp ; vfoo is null?
0xf36c9fd5: je 0xf36c9ff7 ; throw NullPointerException (I guess)
0xf36c9fd7: movl $0x1,0x8(%ebp) ; vfoo.x = 1
0xf36c9fde: mov 0x68(%ecx),%ebp ; sz
0xf36c9fe1: inc %ebx ; i++
0xf36c9fe2: test %edi,0xf7725000 ; safepoint on end of loop
0xf36c9fe8: cmp %ebp,%ebx ; i < sz?
0xf36c9fea: jl 0xf36c9fd0
0xf3771ad0: mov 0x6c(%ecx),%ebp ; vfoo
0xf3771ad3: test %ebp,%ebp ; vfoo is null?
0xf3771ad5: je 0xf3771b09 ; throw NullPointerException (I guess)
0xf3771ad7: movl $0x1,0x8(%ebp) ; vfoo.x = 1
0xf3771ade: mov 0x6c(%ecx),%ebp ; \
0xf3771ae1: mov %ebp,0x70(%ecx) ; / bar = vfoo
0xf3771ae4: mov 0x68(%ecx),%edi ; sz
0xf3771ae7: inc %ebx ; i++
0xf3771ae8: mov %ecx,%eax ;
0xf3771aea: shr $0x9,%eax ; ??? \ Probably replaced later
0xf3771aed: movb $0x0,-0x3113c300(%eax) ; ??? / by some barrier code?
0xf3771af4: test %edi,0xf77ce000 ; safepoint
0xf3771afa: cmp %edi,%ebx ; i < sz ?
0xf3771afc: jl 0xf3771ad0 ;
The reason I think the above code stands in for a barrier is that when taking the NullPointerException, the scalable version has a XCHG
, which acts as a barrier, while the non-scalable version has a NOP there.
The rationale would be that there needs to be a happens-before ordering between the initial load of vfoo
and joining the thread. In the volatile case, the barrier would be inside the loop, so it wouldn't need to be elsewhere. What I don't understand is why XCHG
isn't used inside the loop. Maybe runtime detection of MFENCE support?
Let's try to get the JVM to behave a bit more "consistently." The JIT compiler is really throwing off comparisons of test runs; so let's disable the JIT compiler by using -Djava.compiler=NONE
. This definitely introduces a performance hit, but will help eliminate the obscurity and effects of JIT compiler optimizations.
Garbage collection introduces its own set of complexities. Let's use the serial garbage collector by using -XX:+UseSerialGC
. Let's also disable explicit garbage collections and turn on some logging to see when garbage collection is performed: -verbose:gc -XX:+DisableExplicitGC
. Finally, let's get enough heap allocated using -Xmx128m -Xms128m
.
Now we can run the test using:
java -XX:+UseSerialGC -verbose:gc -XX:+DisableExplicitGC -Djava.compiler=NONE -Xmx128m -Xms128m -server -Dsize=50000000 -Dpar=1 MultiVolatileJavaExperiment 10
Running the test multiple times shows the results are very consistent (I'm using Oracle Java 1.6.0_24-b07 on Ubuntu 10.04.3 LTS with an Intel(R) Core(TM)2 Duo CPU P8700 @ 2.53GHz), averaging somewhere about 2050 milliseconds. If I comment out the bar = vfoo
line, I'm consistently averaging about 1280 milliseconds. Running the test using -Dpar=2
results with an average about 1350 milliseconds with bar = vfoo
and about 1005 milliseconds with it commented.
+=========+======+=========+
| Threads | With | Without |
+=========+======+=========+
| 1 | 2050 | 1280 |
+---------+------+---------+
| 2 | 1350 | 1005 |
+=========+======+=========+
Let's now look at the code and see if we can spot any reasons why multi-threading is inefficient. In Reader.run()
, qualifying variable with this
as appropriate will help make it clear which variables are local:
int i = 0;
while (i < this.sz) {
this.vfoo.x = 1;
this.bar = this.vfoo;
i++;
}
First thing to notice is the while
loop contains four variables referenced through this
. This means the code is accessing the class's runtime constant pool and performing type-checking (via the getfield
bytecode instruction). Let's change the code to try and eliminate accessing the runtime constant pool and see if we get any benefits.
final int mysz = this.sz;
int i = 0;
while (i < mysz) {
this.vfoo.x = 1;
this.bar = this.vfoo;
i++;
}
Here, we're using a local mysz
variable to access the loop size and only accessing sz
through this
once, for initialization. Running the test, with two threads, averages about 1295 milliseconds; a small benefit, but one nonetheless.
Looking at the while
loop, do we really need to reference this.vfoo
twice? The two volatile reads create two synchronization edges that the virtual machine (and underlying hardware, for that matter) needs to manage. Let's say we do want one synchronization edge at the beginning of the while
loop and we don't need two, we can use the following:
final int mysz = this.sz;
Foo myvfoo = null;
int i = 0;
while (i < mysz) {
myvfoo = this.vfoo;
myvfoo.x = 1;
this.bar = myvfoo;
i++;
}
This averages about 1122 milliseconds; still getting better. What about that this.bar
reference? Since we are talking multi-threading, let's say the calculations in the while
loop is what we want to get multi-threaded benefit from and this.bar
is how we communicate our results to others. We really don't want to set this.bar
until after the while
loop is done.
final int mysz = this.sz;
Foo myvfoo = null;
Foo mybar = null;
int i = 0;
while (i < mysz) {
myvfoo = this.vfoo;
myvfoo.x = 1;
mybar = myvfoo;
i++;
}
this.bar = mybar;
Which gives us about 857 milliseconds on average. There's still that final this.vfoo
reference in the while
loop. Assuming again that the while
loop is what we want multi-threaded benefit from, let's move that this.vfoo
out of the while
loop.
final int mysz = this.sz;
final Foo myvfoo = this.vfoo;
Foo mybar = null;
int i = 0;
while (i < mysz) {
myvfoo.x = 1;
mybar = myvfoo;
i++;
}
final Foo vfoocheck = this.vfoo;
if (vfoocheck != myvfoo) {
System.out.println("vfoo changed from " + myvfoo + " to " + vfoocheck);
}
this.bar = mybar;
Now we average about 502 milliseconds; single-threaded test averages about 900 milliseconds.
So what does this tell us? By extrapolating non-local variable references out of the while
loop, there has been significant performance benefits both in the single- and double-threaded tests. The original version of MultiVolatileJavaExperiment
was measuring the cost of accessing non-local variables 50,000,000 times, while the final version is measuring the cost of accessing local variables 50,000,000 times. By using local variables, you increase the likelihood that the Java Virtual Machine and underlying hardware can manage the thread caches more efficiently.
Finally, let's run the tests normally using (notice, using 500,000,000 loop size instead of 50,000,000):
java -Xmx128m -Xms128m -server -Dsize=500000000 -Dpar=2 MultiVolatileJavaExperiment 10
The original version averages about 1100 milliseconds and the modified version averages about 10 millisecond.
You are not actually writing to a volatile field so the volatile field can be cached in each thread.
Using volatile prevents some compiler optimisations and in a micro-benchmark, you can see a large relative difference.
In the example above, the commented out version is longer because it has loop unrolled to place two iterations in one actual loop. This can almost double performance.
When using volatile you can see there is no loop unrolling.
BTW: You can remove a lot of the code in your example to make it easier to read. ;)
Edit: This answer did not stand up to testing.
I have no way to test this right now (no multicore CPU in this machine), but here is a theory: The Foo
instances might not be in the same cache lines, but perhaps the Reader
instances are.
This means the slowdown could be explained by the write to bar
, rather than the read of foo
, because writing to bar
would invalidate that cache line for the other core and cause lots of copying between caches. Commenting out the write to bar
(which is the only write to a field of Reader
in the loop) stops the slowdown, which is consistent with this explanation.
Edit: According to this article, the memory layout of objects is such that the bar
reference would be the last field in the layout of the Reader
object. This means it is probable to land in the same cache line as the next object on the Heap. Since I am not sure about the order in which new objects are allocated on the Heap, I suggested in the comment below to pad both "hot" object types with references, which would be effective in separating the objects (At least, I hope it will, but it depends on how fields of the same type are sorted in memory).
Short: apparently, the answer is false sharing due to card marking for the GC.
A more extensive explanations is given in this question:
Array allocation and access on the Java Virtual Machine and memory contention
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With