How is atomic_flag implemented?

Question

How is atomic_flag is implemented? It feels to me that on x86-64 it is equivalent to atomic_boolanyway, but it is just a guess. Might the x86-64 implementation be any different from arm or x86?

Peter Cordes · Accepted Answer

Yeah, on normal CPUs where atomic<bool> and atomic<int> are also lock-free, it's pretty much like atomic<bool>, using the same instructions. (x86 and x86-64 have the same set of atomic operations available.)

You might think that it would always use x86 lock bts or lock btr to set / reset (clear) a single bit, but it can be more efficient to do other things (especially for a function that returns a bool instead of branching on it). The object is a whole byte so you can just store or exchange the whole byte. (And if the ABI guarantees that the value is always 0 or 1, you don't have to booleanize it before returning the result as a bool)

GCC and clang compile test_and_set to a byte exchange, and clear to a byte store of 0. We get (nearly) identical asm for atomic_flag test_and_set as f.exchange(true);

#include <atomic>

bool TAS(std::atomic_flag &f) {
    return f.test_and_set();
}

bool TAS_bool(std::atomic<bool> &f) {
    return f.exchange(true);
}


void clear(std::atomic_flag &f) {
    //f = 0; // deleted
    f.clear();
}

void clear_relaxed(std::atomic_flag &f) {
    f.clear(std::memory_order_relaxed);
}

void bool_clear(std::atomic<bool> &f) {
    f = false; // deleted
}

On Godbolt for x86-64 with gcc and clang, and for ARMv7 and AArch64.

## GCC9.2 -O3 for x86-64
TAS(std::atomic_flag&):
        mov     eax, 1
        xchg    al, BYTE PTR [rdi]
        ret
TAS_bool(std::atomic<bool>&):
        mov     eax, 1
        xchg    al, BYTE PTR [rdi]
        test    al, al
        setne   al                      # missed optimization, doesn't need to booleanize to 0/1
        ret
clear(std::atomic_flag&):
        mov     BYTE PTR [rdi], 0
        mfence                          # memory fence to drain store buffer before future loads
        ret
clear_relaxed(std::atomic_flag&):
        mov     BYTE PTR [rdi], 0      # x86 stores are already mo_release, no barrier
        ret
bool_clear(std::atomic<bool>&):
        mov     BYTE PTR [rdi], 0
        mfence
        ret

Note that xchg is also an efficient way to do a seq_cst store on x86-64, usually more efficient than the mov + mfence that gcc uses. Clang uses xchg for all of these (except the relaxed store).

Amusingly, clang re-booleanizes to 0/1 after the xchg in atomic_flag.test_and_set(), but GCC instead does it after atomic<bool>. clang does a weird and al,1 in TAS_bool, which would treat values like 2 as false. It seems totally pointless; the ABI guarantees that a bool in memory is always stored as a 0 or 1 byte.

For ARM, we have ldrexb / strexb exchange retry loops, or just strb + dmb ish for the pure store. Or AArch64 can use stlrb wzr, [x0] for clear or assign-false to do a sequential-release store (of the zero-register) without needing a barrier.

How is atomic_flag implemented?

Tags:

c++

atomic

x86-64

arm

stdatomic

Yuki

1 Answers

Peter Cordes

Recent Activity

Donate For Us

How is atomic_flag implemented?

Tags:

c++

atomic

x86-64

arm

stdatomic

Yuki

1 Answers

Peter Cordes

Related questions

Recent Activity

Donate For Us