How is atomic_flag is implemented? It feels to me that on x86-64 it is equivalent to atomic_boolanyway, but it is just a guess. Might the x86-64 implementation be any different from arm or x86?
Yeah, on normal CPUs where atomic<bool> and atomic<int> are also lock-free, it's pretty much like atomic<bool>, using the same instructions. (x86 and x86-64 have the same set of atomic operations available.)
You might think that it would always use x86 lock bts or lock btr to set / reset (clear) a single bit, but it can be more efficient to do other things (especially for a function that returns a bool instead of branching on it). The object is a whole byte so you can just store or exchange the whole byte. (And if the ABI guarantees that the value is always 0 or 1, you don't have to booleanize it before returning the result as a bool)
GCC and clang compile test_and_set to a byte exchange, and clear to a byte store of 0. We get (nearly) identical asm for atomic_flag test_and_set as f.exchange(true);
#include <atomic>
bool TAS(std::atomic_flag &f) {
return f.test_and_set();
}
bool TAS_bool(std::atomic<bool> &f) {
return f.exchange(true);
}
void clear(std::atomic_flag &f) {
//f = 0; // deleted
f.clear();
}
void clear_relaxed(std::atomic_flag &f) {
f.clear(std::memory_order_relaxed);
}
void bool_clear(std::atomic<bool> &f) {
f = false; // deleted
}
On Godbolt for x86-64 with gcc and clang, and for ARMv7 and AArch64.
## GCC9.2 -O3 for x86-64
TAS(std::atomic_flag&):
mov eax, 1
xchg al, BYTE PTR [rdi]
ret
TAS_bool(std::atomic<bool>&):
mov eax, 1
xchg al, BYTE PTR [rdi]
test al, al
setne al # missed optimization, doesn't need to booleanize to 0/1
ret
clear(std::atomic_flag&):
mov BYTE PTR [rdi], 0
mfence # memory fence to drain store buffer before future loads
ret
clear_relaxed(std::atomic_flag&):
mov BYTE PTR [rdi], 0 # x86 stores are already mo_release, no barrier
ret
bool_clear(std::atomic<bool>&):
mov BYTE PTR [rdi], 0
mfence
ret
Note that xchg is also an efficient way to do a seq_cst store on x86-64, usually more efficient than the mov + mfence that gcc uses. Clang uses xchg for all of these (except the relaxed store).
Amusingly, clang re-booleanizes to 0/1 after the xchg in atomic_flag.test_and_set(), but GCC instead does it after atomic<bool>. clang does a weird and al,1 in TAS_bool, which would treat values like 2 as false. It seems totally pointless; the ABI guarantees that a bool in memory is always stored as a 0 or 1 byte.
For ARM, we have ldrexb / strexb exchange retry loops, or just strb + dmb ish for the pure store. Or AArch64 can use stlrb wzr, [x0] for clear or assign-false to do a sequential-release store (of the zero-register) without needing a barrier.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With