Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there any compiler barrier which is equal to asm("" ::: "memory") in C++11?

My test code is as below, and I found that only the memory_order_seq_cst forbade compiler's reorder.

#include <atomic>

using namespace std;

int A, B = 1;

void func(void) {
    A = B + 1;
    atomic_thread_fence(memory_order_seq_cst);
    B = 0;
}

And other choices such as memory_order_release, memory_order_acq_rel did not generate any compiler barrier at all.

I think they must work with atomic variable just as below.

#include <atomic>

using namespace std;

atomic<int> A(0);
int B = 1;

void func(void) {
    A.store(B+1, memory_order_release);
    B = 0;
}

But I do not want to use atomic variable. At the same time, I think the "asm("":::"memory")" is too low level.

Is there any better choice?

like image 871
lxyscls Avatar asked Nov 13 '16 22:11

lxyscls


1 Answers

re: your edit:

But I do not want to use atomic variable.

Why not? If it's for performance reasons, use them with memory_order_relaxed and atomic_signal_fence(mo_whatever) to block compiler reordering without any runtime overhead other than the compiler barrier potentially blocking some compile-time optimizations, depending on the surrounding code.

If it's for some other reason, then maybe atomic_signal_fence will give you code that happens to work on your target platform. I suspect that most implementations of it do order non-atomic<> loads and stores in practice, at least as an implementation detail, and probably effectively required if there are accesses to atomic<> variables. So it might help in practice to avoid some actual consequences of any data-race Undefined Behaviour which would still exist. (e.g. as part of a SeqLock implementation where for efficiency you want to use non-atomic reads / writes of the shared data so the compiler can use SIMD vector copies, for example.)

See Who's afraid of a big bad optimizing compiler? on LWN for some details about the badness you can run into (like invented loads) if you only use compiler barriers to force reloads of non-atomic variables, instead of using something with read-exactly-once semantics. (In that article, they're talking about Linux kernel code so they're using volatile for hand-rolled load/store atomics. But in general don't do that: When to use volatile with multi threading? - pretty much never)


Sufficient for what?

Regardless of any barriers, if two threads run this function at the same time, your program has Undefined Behaviour because of concurrent access to non-atomic<> variables. So the only way this code can be useful is if you're talking about synchronizing with a signal handler that runs in the same thread.

That would also be consistent with asking for a "compiler barrier", to only prevent reordering at compile time, because out-of-order execution and memory reordering always preserve the behaviour of a single thread. So you never need extra barrier instructions to make sure you see your own operations in program order, you just need to stop the compiler reordering stuff at compile time. See Jeff Preshing's post: Memory Ordering at Compile Time

This is what atomic_signal_fence is for. You can use it with any std::memory_order, just like thread_fence, to get different strengths of barrier and only prevent the optimizations you need to prevent.


... atomic_thread_fence(memory_order_acq_rel) did not generate any compiler barrier at all!

Totally wrong, in several ways.

atomic_thread_fence is a compiler barrier plus whatever run-time barriers are necessary to restrict reordering in the order our loads/stores become visible to other threads.

I'm guessing you mean it didn't emit any barrier instructions when you looked at the asm output for x86. Instructions like x86's MFENCE are not "compiler barriers", they're run-time memory barriers and prevent even StoreLoad reordering at run-time. (That's the only reordering that x86 allows. SFENCE and LFENCE are only needed when using weakly-ordered (NT) stores, like MOVNTPS (_mm_stream_ps).)

On a weakly-ordered ISA like ARM, thread_fence(mo_acq_rel) isn't free, and compiles to an instruction. gcc5.4 uses dmb ish. (See it on the Godbolt compiler explorer).

A compiler barrier just prevents reordering at compile time, without necessarily preventing run-time reordering. So even on ARM, atomic_signal_fence(mo_seq_cst) compiles to no instructions.

A weak enough barrier allows the compiler to do the store to B ahead of the store to A if it wants, but gcc happens to decide to still do them in source order even with thread_fence(mo_acquire) (which shouldn't order stores with other stores).

So this example doesn't really test whether something is a compiler barrier or not.


Strange compiler behaviour from gcc for an example that is different with a compiler barrier:

See this source+asm on Godbolt.

#include <atomic>
using namespace std;
int A,B;

void foo() {
  A = 0;
  atomic_thread_fence(memory_order_release);
  B = 1;
  //asm volatile(""::: "memory");
  //atomic_signal_fence(memory_order_release);
  atomic_thread_fence(memory_order_release);
  A = 2;
}

This compiles with clang the way you'd expect: the thread_fence is a StoreStore barrier, so the A=0 has to happen before B=1, and can't be merged with the A=2.

    # clang3.9 -O3
    mov     dword ptr [rip + A], 0
    mov     dword ptr [rip + B], 1
    mov     dword ptr [rip + A], 2
    ret

But with gcc, the barrier has no effect, and only the final store to A is present in the asm output.

    # gcc6.2 -O3
    mov     DWORD PTR B[rip], 1
    mov     DWORD PTR A[rip], 2
    ret

But with atomic_signal_fence(memory_order_release), gcc's output matches clang. So atomic_signal_fence(mo_release) is having the barrier effect we expect, but atomic_thread_fence with anything weaker than seq_cst isn't acting as a compiler barrier at all.

One theory here is that gcc knows that it's officially Undefined Behaviour for multiple threads to write to non-atomic<> variables. This doesn't hold much water, because atomic_thread_fence should still work if used to synchronize with a signal handler, it's just stronger than necessary.

BTW, with atomic_thread_fence(memory_order_seq_cst), we get the expected

    # gcc6.2 -O3, with a mo_seq_cst barrier
    mov     DWORD PTR A[rip], 0
    mov     DWORD PTR B[rip], 1
    mfence
    mov     DWORD PTR A[rip], 2
    ret

We get this even with only one barrier, which would still allow the A=0 and A=2 stores to happen one after the other, so the compiler is allowed to merge them across a barrier. (Observers failing to see separate A=0 and A=2 values is a possible ordering, so the compiler can decide that's what always happens). Current compilers don't usually do this kind of optimization, though. See discussion at the end of my answer on Can num++ be atomic for 'int num'?.

like image 96
Peter Cordes Avatar answered Nov 16 '22 00:11

Peter Cordes