GCC memory barrier __sync_synchronize vs asm volatile("": : :"memory")

People also ask

What is memory barrier in GCC?

In computing, a memory barrier, also known as a membar, memory fence or fence instruction, is a type of barrier instruction that causes a central processing unit (CPU) or compiler to enforce an ordering constraint on memory operations issued before and after the barrier instruction.

Is volatile a memory barrier?

volatile in most programming languages does not imply a real CPU read memory barrier but an order to the compiler not to optimize the reads via caching in a register. This means that the reading process/thread will get the value "eventually".

What is barrier instruction?

The memory barrier instructions halt execution of the application code until a memory write of an instruction has finished executing. They are used to ensure that a critical section of code has been completed before continuing execution of the application code.

What is a compiler barrier?

Complier Barriers. A compiler barrier is a sequence point. At such a point, we want all previous operations to have stored their results to memory, and we want all future operations to not have been started yet. The most common sequence point is a function call.

There's a significant difference - the first option (inline asm) actually does nothing at runtime, there's no command performed there and the CPU doesn't know about it. it only serves at compile time, to tell the compiler not to move loads or stores beyond this point (in any direction) as part of its optimizations. It's called a SW barrier.

The second barrier (builtin sync), would simply translate into a HW barrier, probably a fence (mfence/sfence) operations if you're on x86, or its equivalents in other architectures. The CPU may also do various optimizations at runtime, the most important one is actually performing operations out-of-order - this instruction tells it to make sure that loads or stores can't pass this point and must be observed in the correct side of the sync point.

Here's another good explanation:

Types of Memory Barriers

As mentioned above, both compilers and processors can optimize the execution of instructions in a way that necessitates the use of a memory barrier. A memory barrier that affects both the compiler and the processor is a hardware memory barrier, and a memory barrier that only affects the compiler is a software memory barrier.

In addition to hardware and software memory barriers, a memory barrier can be restricted to memory reads, memory writes, or both. A memory barrier that affects both reads and writes is a full memory barrier.

There is also a class of memory barrier that is specific to multi-processor environments. The name of these memory barriers are prefixed with "smp". On a multi-processor system, these barriers are hardware memory barriers and on uni-processor systems, they are software memory barriers.

The barrier() macro is the only software memory barrier, and it is a full memory barrier. All other memory barriers in the Linux kernel are hardware barriers. A hardware memory barrier is an implied software barrier.

An example for when SW barrier is useful: consider the following code -

for (i = 0; i < N; ++i) {
    a[i]++;
}

This simple loop, compiled with optimizations, would most likely be unrolled and vectorized. Here's the assembly code gcc 4.8.0 -O3 generated packed (vector) operations:

400420:       66 0f 6f 00             movdqa (%rax),%xmm0
400424:       48 83 c0 10             add    $0x10,%rax
400428:       66 0f fe c1             paddd  %xmm1,%xmm0
40042c:       66 0f 7f 40 f0          movdqa %xmm0,0xfffffffffffffff0(%rax)
400431:       48 39 d0                cmp    %rdx,%rax
400434:       75 ea                   jne    400420 <main+0x30>

However, when adding your inline assembly on each iteration, gcc is not permitted to change the order of the operations past the barrier, so it can't group them, and the assembly becomes the scalar version of the loop:

400418:       83 00 01                addl   $0x1,(%rax)
40041b:       48 83 c0 04             add    $0x4,%rax
40041f:       48 39 d0                cmp    %rdx,%rax
400422:       75 f4                   jne    400418 <main+0x28>

However, when the CPU performes this code, it's permitted to reorder the operations "under the hood", as long as it does not break memory ordering model. This means that performing the operations can be done out of order (if the CPU supports that, as most do these days). A HW fence would have prevented that.

A comment on the usefulness of SW-only barriers:

On some micro-controllers, and other embedded platforms, you may have multitasking, but no cache system or cache latency, and hence no HW barrier instructions. So you need to do things like SW spin-locks. The SW barrier prevents compiler optimizations (read/write combining and reordering) in these algorithms.

Related questions
                            
                                What are scanf("%*s") and scanf("%*d") format identifiers?
                            
                                argv[argc] ==?
                            
                                C++ -- return x,y; What is the point?
                            
                                Shall I prefer constants over defines?
                            
                                C warning Missing sentinel in function call
                            
                                Why does a C comment like /* */ need '<'?
                            
                                how to declare variable type, C style in python
                            
                                C struct inheritance pointer alignment
                            
                                How do you call Python code from C code?
                            
                                What is the 'asmlinkage' modifier meant for?
                            
                                Which macro to wrap Mac OS X specific code in C/C++
                            
                                Create an object using Python's C API
                            
                                What is the difference between C and embedded C?
                            
                                How to improve performance of this numerical computation in Haskell?
                            
                                Win32 - Backtrace from C code
                            
                                Correct printf format specifier for size_t: %zu or %Iu?
                            
                                Linking two shared libraries with some of the same symbols
                            
                                What's meaning of "EXPORT_SYMBOL" in Linux kernel code?
                            
                                Good STL-like library for C [closed]
                            
                                any good tool for makefile generation? [closed]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

GCC memory barrier __sync_synchronize vs asm volatile("": : :"memory")

Tags:

c

gcc

People also ask

Types of Memory Barriers

Recent Activity

Donate For Us