gcc and cpu_relax, smb_mb, etc.?

Question

I've been reading on compiler optimizations vs CPU optimizations, and volatile vs memory barriers.

One thing which isn't clear to me is that my current understanding is that CPU optimizations and compiler optimizations are orthogonal. I.e. can occur independently of each other.

However, the article volatile considered harmful makes the point that volatile should not be used. Linus's post makes similar claims. The main reasoning, IIUC, is that marking a variable as volatile disables all compiler optimizations when accessing that variable (i.e. even if they are not harmful), while still not providing protection against memory reorderings. Essentially, the main point is that it's not the data that should be handled with care, but rather a particular access pattern needs to be handled with care.

Now, the volatile considered harmful article gives the following example of a busy loop waiting for a flag:

while (my_variable != what_i_want) {}

and makes the point that the compiler can optimize the access to my_variable so that it only occurs once and not in a loop. The solution, so the article claims, is the following:

while (my_variable != what_i_want)
    cpu_relax();

It is said that cpu_relax acts as a compiler barrier (earlier versions of the article said that it's a memory barrier).

I have several gaps here:

1) Is the implication that gcc has special knowledge of the cpu_relax call, and that it translates to a hint to both the compiler and the CPU?

2) Is the same true for other instructions such as smb_mb() and the likes?

3) How does that work, given that cpu_relax is essentially defined as a C macro? If I manually expand cpu_relax will gcc still respect it as a compiler barrier? How can I know which calls are respected by gcc?

4) What is the scope of cpu_relax as far as gcc is concerned? In other words, what's the scope of reads that cannot be optimized by gcc when it sees the cpu_relax instruction? From the CPU's perspective, the scope is wide (memory barriers place a mark in the read or write buffer). I would guess gcc uses a smaller scope - perhaps the C scope?

Arch D. Robison · Accepted Answer

Yes, gcc has special knowledge of the semantics of cpu_relax or whatever it expands to, and must translate it to something for which the hardware will respect the semantics too.
Yes, any kind of memory fencing primitive needs special respect by the compiler and hardware.
Look at what the macro expands to, e.g. compile with "gcc -E" and examine the output. You'll have to read the compiler documentation to find out the semantics of the primitives.
The scope of a memory fence is as wide as the scope the compiler might move a load or store across. A non-optimizing compiler that never moves loads or stores across a subroutine call might not need to pay much attention to a memory fence that is represented as a subroutine call. An optimizing compiler that does interprocedural optimization across translation units would need to track a memory fence across a much bigger scope.

dturvene · Answer

There are a number subtle questions related to cpu and smp concurrency in your questions which will require you to look at the kernel code. Here are some quick ideas to get you started on the research specifically for the x86 architecture.

The idea is that you are trying to perform a concurrency operation where your kernel task (see kernel source sched.h for struct task_struct) is in a tight loop comparing my_variable with a local variable until it is changed by another kernel task (or change asynchronously by a hardware device!) This is a common pattern in the kernel.

The kernel has been ported to a number of architectures and each has a specific set of machine instructions to handle concurrency. For x86, cpu_relax maps to the PAUSE machine instruction. It allows an x86 CPU to more efficiently run a spinlock so that the lock variable update is more readily visible by the spinning CPU. GCC will execute the function/macro just like any other function. If cpu_relax is removed from the loop then gcc CAN consider the loop as non-functional and remove it. Look at the Intel X86 Software Manuals for the PAUSE instruction.
smp_mb is an x86 memory fence instruction that flushes the memory cache. One CPU can change my_variable in its cache but it will not be visible to other CPUs. smp_mb provides on-demand cache coherency. Look at the Intel X86 Software Manuals for MFENCE/LFENCE instructions.

Note that smp_mb() flushes the CPU cache so it CAN be an expensive operation. Current Intel CPUs have huge caches (~6MB).

If you expand cpu_relax on an x86, it will show asm volatile("rep; nop" ::: "memory"). This is NOT a compiler barrier but code that GCC will not optimize out. See the barrier macro, which is asm volatile("": : : "memory") for the GCC hint.
I'm not clear what you mean by "scope of cpu_relax". Some possible ideas: It's the PAUSE machine instruction, similar to ADD or MOV. PAUSE will affect only the current CPU. PAUSE allows for more efficient cache coherency between CPUs.

I just looked at the PAUSE instruction a little more - an additional property is it prevents the CPU from doing out-of-order memory speculation when leaving a tight loop/spinlock. I'm not clear what THAT means but I suppose it could briefly indicate a false value in a variable? Still a lot of questions....

gcc and cpu_relax, smb_mb, etc.?

Tags:

optimization

gcc

memory-barriers

barrier

YSK

2 Answers

Arch D. Robison

dturvene

Recent Activity

Donate For Us

gcc and cpu_relax, smb_mb, etc.?

Tags:

optimization

gcc

memory-barriers

barrier

YSK

2 Answers

Arch D. Robison

dturvene

Related questions

Recent Activity

Donate For Us