Least intrusive compile barrier for Java on x86

Question

If I hava a Java process interacting with some other process via a shared ByteBuffer or similar, what would be the least intrusive equivalent of a compiler barrier in C/C++? No portability is required - I am specifically interested in x86.

For example I have 2 processes reading and writing to an area of memory as per the pseudocode:

p1:
    i = 0
    while true:
      A = 0
      //Write to B
      A = ++i

p2:
    a1 = A
    //Read from B
    a2 = A

    if a1 == a2 and a1 != 0:
      //Read was valid

Due to the strict memory ordering on x86 (loads to separate locations not reorder and reads to separate locations not reordered) this does not require any memory barrier in C++, just a compile barrier between each write and between each read (i.e. asm volatile).

How can I achieve the same ordering constraint in Java in the least expensive manner. Is there anything less intrusive than writing to a volatile?

BeeOnRope · Accepted Answer

sun.misc.Unsafe.putOrdered should do what you want - a store with the lock implied on x86 by volatile. The compiler will not move instructions around it, I believe.

This is the same as lazySet on AtomicInteger and friends, but that can't be used directly with ByteBuffer.

Unlike volatile or the AtomicThings classes, that method applies to the specific writes you use it on, and not the definition of the member, so using it doesn't imply anything for reads.

It looks like you are trying to implement something like a seqlock - meaning you need to avoid re-ordering between reads of the version counter, A, and the reads/writes of the data itself. A plain int isn't going to cut it - since the JIT might do all sorts of naughty things. My recommendation would be to use a volatile int for your counter, but then write it to it with putOrdered. This way, you don't pay the price for volatile writes (a dozen cycles or more, usually), while getting the compiler barrier implied by the volatile read (and the hardware barrier for those reads is a no-op, making them fast).

All that said, I think you are in a grey area here, because lazySet isn't a part of the formal memory model, and doesn't fit cleanly into the happens-before reasoning, so you need a deeper understanding of the actual JIT and hardware implementation to see if you can combine things in this way.

Finally, even with volatile reads and writes (ignoring lazySet), I don't think your seqlock is sound from point of view of the java memory model, because volatile writes only set up a happens-before between that write and later reads on another thread, and earlier actions in the writing thread, but not between the read and actions following the write on the writing thread. Said another way, it is a unidirectional fence, not a bidirectional one. I believe writes in version N+1 to your shared region can be seen by the reading thread even while it reads A == N twice.

Clarification from the comment:

Volatile only sets up a one way barrier. It is very similar to acquire/release semantics used by WinTel in some APIs. For example, assume A, Bv, and C all initially zero:

Thread 1:
A = 1;   // 1
Bv = 1;  // 2
C = 1;   // 3

Thread 2:

int c = C;  // 4
int b = Bv; // 5
int a = A;  // 6

Here, only Bv is volatile. The two threads are doing something similar in concept to your seqlock writers and readers - thread 1 writes some stuff in one order, and thread 2 reads the same stuff in a reverse order, and tries to reason about ordering from that.

If thread two has b == 1, then a == 1 always, because 1 happens-before 2 (program order), and 5 happens before 6 (program order), and most critically 2 happens before 5 since 5 read the value written at 2. So in this way the write and read of Bv is acting like a fence. Things above (2) cannot "move below" (2), and things below (5) cannot "move above" 5. Note I only restricted movement in one directly for each thread, however, not both, which brings us to our next example:

Equivalently to the above, you might assume that if a == 0, then c == 0 also, since C is written after a, and read before. However, volatiles don't guarantee this. In particular, the happens-before reasoning above doesn't prevent (3) from being moved above (2) as observed by thread 2, nor do they prevent (4) from being pushed below (5).

Update:

Let's look at your example specifically.

What I believe can happen is this, unrolling the write loop which occurs in p1.

p1:

i = 0
A = 0
// (p1-1) write data1 to B
A = ++i;  // (p1-2) 1 assigned to A

A=0  // (p1-3)
// (p1-4) write data2 to B
A = ++i;  // (p1-5) 2 assigned to A

p2:

a1 = A // (p2-1)
//Read from B // (p2-2)
a2 = A // (p2-3)

if a1 == a2 and a1 != 0:

Let's say p2 sees 1 for a1 and a2. This means there is a happens before between p2-1 and p1-2 (and by extension p1-1), and also between p2-3 and p1-2. However there is happens-before between anything in p2 and p1-4. So in fact, I believe the read of B at p2-2 can observe the second (perhaps partially completed) read at p1-4, which can "move above" the volatile writes at p1-2 and p1-3.

It's interesting enough that I think you might make a new question just on that alone - forget about faster barriers - does this work at all even with volatile?

Peter Lawrey · Answer

You can use lazySet, it can be up to 10x faster than setting a volatile field as it doesn't stall the CPU pipeline. e.g. AtomicLong lazySet or you can use the Unsafe equivalent if you need.

Least intrusive compile barrier for Java on x86

Tags:

java

performance

memory

x86

barrier

jmetcalfe

2 Answers

BeeOnRope

Peter Lawrey

Recent Activity

Donate For Us

Least intrusive compile barrier for Java on x86

Tags:

java

performance

memory

x86

barrier

jmetcalfe

2 Answers

BeeOnRope

Peter Lawrey

Related questions

Recent Activity

Donate For Us