Why do GCC and Clang generates so different asm for this code (x86_64, -O3 -std=c++17)?
#include <atomic>
int global_var = 0;
int foo_seq_cst(int a)
{
std::atomic<int> ia;
ia.store(global_var + a, std::memory_order_seq_cst);
return ia.load(std::memory_order_seq_cst);
}
int foo_relaxed(int a)
{
std::atomic<int> ia;
ia.store(global_var + a, std::memory_order_relaxed);
return ia.load(std::memory_order_relaxed);
}
GCC 9.1:
foo_seq_cst(int):
add edi, DWORD PTR global_var[rip]
mov DWORD PTR [rsp-4], edi
mfence
mov eax, DWORD PTR [rsp-4]
ret
foo_relaxed(int):
add edi, DWORD PTR global_var[rip]
mov DWORD PTR [rsp-4], edi
mov eax, DWORD PTR [rsp-4]
ret
Clang 8.0:
foo_seq_cst(int): # @foo_seq_cst(int)
mov eax, edi
add eax, dword ptr [rip + global_var]
ret
foo_relaxed(int): # @foo_relaxed(int)
mov eax, edi
add eax, dword ptr [rip + global_var]
ret
I suspect that mfence here is an overkill, am I right? Or Clang generates code that can leads to bugs in some cases?
A more realistic example:
#include <atomic>
std::atomic<int> a;
void foo_seq_cst(int b) {
a = b;
}
void foo_relaxed(int b) {
a.store(b, std::memory_order_relaxed);
}
gcc-9.1:
foo_seq_cst(int):
mov DWORD PTR a[rip], edi
mfence
ret
foo_relaxed(int):
mov DWORD PTR a[rip], edi
ret
clang-8.0:
foo_seq_cst(int): # @foo_seq_cst(int)
xchg dword ptr [rip + a], edi
ret
foo_relaxed(int): # @foo_relaxed(int)
mov dword ptr [rip + a], edi
ret
gcc uses mfence
, whereas clang uses xchg
for std::memory_order_seq_cst
.
xchg
implies lock
prefix. Both lock
and mfence
satisfy the requirements of std::memory_order_seq_cst
, which is no reordering and total order.
From Intel 64 and IA-32 Architectures Software Developer’s Manual:
MFENCE—Memory Fence
Performs a serializing operation on all load-from-memory and store-to-memory instructions that were issued prior the MFENCE instruction. This serializing operation guarantees that every load and store instruction that precedes the MFENCE instruction in program order becomes globally visible before any load or store instruction that follows the MFENCE instruction. The MFENCE instruction is ordered with respect to all load and store instructions, other MFENCE instructions, any LFENCE and SFENCE instructions, and any serializing instructions (such as the CPUID instruction). MFENCE does not serialize the instruction stream.
8.2.3.8 Locked Instructions Have a Total Order
The memory-ordering model ensures that all processors agree on a single execution order of all locked instructions, including those that are larger than 8 bytes or are not naturally aligned.
8.2.3.9 Loads and Stores Are Not Reordered with Locked Instructions
The memory-ordering model prevents loads and stores from being reordered with locked instructions that execute earlier or later.
lock
was benchmarked to be 2-3x faster than mfence
and Linux switched from mfence
to lock
where possible.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With