Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

GCC memory barrier __sync_synchronize vs asm volatile("": : :"memory")

Tags:

c

gcc

People also ask

What is memory barrier in GCC?

In computing, a memory barrier, also known as a membar, memory fence or fence instruction, is a type of barrier instruction that causes a central processing unit (CPU) or compiler to enforce an ordering constraint on memory operations issued before and after the barrier instruction.

Is volatile a memory barrier?

volatile in most programming languages does not imply a real CPU read memory barrier but an order to the compiler not to optimize the reads via caching in a register. This means that the reading process/thread will get the value "eventually".

What is barrier instruction?

The memory barrier instructions halt execution of the application code until a memory write of an instruction has finished executing. They are used to ensure that a critical section of code has been completed before continuing execution of the application code.

What is a compiler barrier?

Complier Barriers. A compiler barrier is a sequence point. At such a point, we want all previous operations to have stored their results to memory, and we want all future operations to not have been started yet. The most common sequence point is a function call.


There's a significant difference - the first option (inline asm) actually does nothing at runtime, there's no command performed there and the CPU doesn't know about it. it only serves at compile time, to tell the compiler not to move loads or stores beyond this point (in any direction) as part of its optimizations. It's called a SW barrier.

The second barrier (builtin sync), would simply translate into a HW barrier, probably a fence (mfence/sfence) operations if you're on x86, or its equivalents in other architectures. The CPU may also do various optimizations at runtime, the most important one is actually performing operations out-of-order - this instruction tells it to make sure that loads or stores can't pass this point and must be observed in the correct side of the sync point.

Here's another good explanation:

Types of Memory Barriers

As mentioned above, both compilers and processors can optimize the execution of instructions in a way that necessitates the use of a memory barrier. A memory barrier that affects both the compiler and the processor is a hardware memory barrier, and a memory barrier that only affects the compiler is a software memory barrier.

In addition to hardware and software memory barriers, a memory barrier can be restricted to memory reads, memory writes, or both. A memory barrier that affects both reads and writes is a full memory barrier.

There is also a class of memory barrier that is specific to multi-processor environments. The name of these memory barriers are prefixed with "smp". On a multi-processor system, these barriers are hardware memory barriers and on uni-processor systems, they are software memory barriers.

The barrier() macro is the only software memory barrier, and it is a full memory barrier. All other memory barriers in the Linux kernel are hardware barriers. A hardware memory barrier is an implied software barrier.

An example for when SW barrier is useful: consider the following code -

for (i = 0; i < N; ++i) {
    a[i]++;
}

This simple loop, compiled with optimizations, would most likely be unrolled and vectorized. Here's the assembly code gcc 4.8.0 -O3 generated packed (vector) operations:

400420:       66 0f 6f 00             movdqa (%rax),%xmm0
400424:       48 83 c0 10             add    $0x10,%rax
400428:       66 0f fe c1             paddd  %xmm1,%xmm0
40042c:       66 0f 7f 40 f0          movdqa %xmm0,0xfffffffffffffff0(%rax)
400431:       48 39 d0                cmp    %rdx,%rax
400434:       75 ea                   jne    400420 <main+0x30>

However, when adding your inline assembly on each iteration, gcc is not permitted to change the order of the operations past the barrier, so it can't group them, and the assembly becomes the scalar version of the loop:

400418:       83 00 01                addl   $0x1,(%rax)
40041b:       48 83 c0 04             add    $0x4,%rax
40041f:       48 39 d0                cmp    %rdx,%rax
400422:       75 f4                   jne    400418 <main+0x28>

However, when the CPU performes this code, it's permitted to reorder the operations "under the hood", as long as it does not break memory ordering model. This means that performing the operations can be done out of order (if the CPU supports that, as most do these days). A HW fence would have prevented that.


A comment on the usefulness of SW-only barriers:

On some micro-controllers, and other embedded platforms, you may have multitasking, but no cache system or cache latency, and hence no HW barrier instructions. So you need to do things like SW spin-locks. The SW barrier prevents compiler optimizations (read/write combining and reordering) in these algorithms.