I have a JMH multi-thread test:
@State(Scope.Benchmark)
@BenchmarkMode(Mode.Throughput)
@OutputTimeUnit(TimeUnit.MICROSECONDS)
@Fork(value = 1, jvmArgsAppend = { "-Xmx512m", "-server", "-XX:+AggressiveOpts","-XX:+UnlockDiagnosticVMOptions",
"-XX:+UnlockExperimentalVMOptions", "-XX:+PrintAssembly", "-XX:PrintAssemblyOptions=intel",
"-XX:+PrintSignatureHandlers"})
@Measurement(iterations = 5, time = 5, timeUnit = TimeUnit.SECONDS)
@Warmup(iterations = 3, time = 2, timeUnit = TimeUnit.SECONDS)
public class LinkedQueueBenchmark {
private static final Unsafe unsafe = UnsafeProvider.getUnsafe();
private static final long offsetObject;
private static final long offsetNext;
private static final int THREADS = 5;
private static class Node {
private volatile Node next;
public Node() {}
}
static {
try {
offsetObject = unsafe.objectFieldOffset(LinkedQueueBenchmark.class.getDeclaredField("object"));
offsetNext = unsafe.objectFieldOffset(Node.class.getDeclaredField("next"));
} catch (Exception ex) { throw new Error(ex); }
}
protected long t0,t1,t2,t3,t4,t5,t6,t7;
private volatile Node object = new Node(null);
@Threads(THREADS)
@Benchmark
public Node doTestCasSmart() {
Node current, o = new Node();
for(;;) {
current = this.object;
if (unsafe.compareAndSwapObject(this, offsetObject, current, o)) {
//current.next = o; //Special line:
break;
} else {
LockSupport.parkNanos(1);
}
}
return current;
}
}
As I understand, that is something happens with CPU-caches and maybe it's cleaning the store-buffer. If I do replace it to lock-based method, without CAS, performance will be 11-20 ops/us.
I trying use LinuxPerfAsmProfiler and PrintAssembly, in the second case I see:
....[Hottest Regions]...............................................................................
25.92% 17.93% [0x7f1d5105fe60:0x7f1d5105fe69] in SpinPause (libjvm.so)
17.53% 20.62% [0x7f1d5119dd88:0x7f1d5119de57] in ParMarkBitMap::live_words_in_range(HeapWord*, oopDesc*) const (libjvm.so)
10.81% 6.30% [0x7f1d5129cff5:0x7f1d5129d0ed] in ParallelTaskTerminator::offer_termination(TerminatorTerminator*) (libjvm.so)
7.99% 9.86% [0x7f1d3c51d280:0x7f1d3c51d3a2] in com.jad.generated.LinkedQueueBenchmark_doTestCasSmart::doTestCasSmart_thrpt_jmhStub
Can someone explain to me what really happen? Why it's so slow? Where here store-load barrier? Why is putOrdered not working? And how to fix it?
Rule: Instead of looking for "advanced" answers, you should first look for stupid mistakes.
SpinPause
, ParMarkBitMap::live_words_in_range(HeapWord*, oopDesc*)
and ParallelTaskTerminator::offer_termination(TerminatorTerminator*)
are from GC threads. Which may very well mean most of the work benchmark does is GC. Indeed, running the "special line" uncommented with -prof gc
yields:
# Run complete. Total time: 00:00:43
Benchmark Mode Cnt Score Error Units
LQB.doTestCasSmart thrpt 5 5.930 ± 3.867 ops/us
LQB.doTestCasSmart:·gc.time thrpt 5 29970.000 ms
Therefore, out of 43 seconds for the run, you've spent 30 seconds doing GC. Or, even the plain -verbose:gc
shows it:
Iteration 3: [Full GC (Ergonomics) 408188K->1542K(454656K), 0.0043022 secs]
[GC (Allocation Failure) 60422K->60174K(454656K), 0.2061024 secs]
[GC (Allocation Failure) 119054K->118830K(454656K), 0.2314572 secs]
[GC (Allocation Failure) 177710K->177430K(454656K), 0.2268396 secs]
[GC (Allocation Failure) 236310K->236054K(454656K), 0.1718049 secs]
[GC (Allocation Failure) 294934K->294566K(454656K), 0.2265855 secs]
[Full GC (Ergonomics) 294566K->147408K(466432K), 0.7139546 secs]
[GC (Allocation Failure) 206288K->205880K(466432K), 0.2065388 secs]
[GC (Allocation Failure) 264760K->264312K(466432K), 0.2314117 secs]
[GC (Allocation Failure) 323192K->323016K(466432K), 0.2183271 secs]
[Full GC (Ergonomics) 323016K->322663K(466432K), 2.8058725 secs]
2.8s full GCs, that sucks. Around 5s spent in GC, in a iteration that is bounded by 5s of run time. That sucks too.
Why is it? Well, you are building the linked list there. Granted, the head of the queue is unreachable, and everything from the head to your object
should be collected. But the collection is not instantaneous. The longer the queue gets, the more memory it consumes, the more work for GC to traverse it. That's a positive feedback loop that cripples down the execution. Since there the queue elements are collectable anyhow, this feedback loop never reaches OOME. Storing an intial object
in a new head
field will make the test to eventually OOME.
Therefore, your question has nothing to do with either putOrdered
, or memory barriers, or queue performance, to be frank. I think you need to reconsider what do you actually test. Designing the test so that the transient memory footprint stays the same on each @Benchmark
call is an art in itself.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With