Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

C++ atomics and cross-thread visibility

AFAIK C++ atomics (<atomic>) family provide 3 benefits:

  • primitive instructions indivisibility (no dirty reads),
  • memory ordering (both, for CPU and compiler) and
  • cross-thread visibility/changes propagation.

And I am not sure about the third bullet, thus take a look at the following example.

#include <atomic>

std::atomic_bool a_flag = ATOMIC_VAR_INIT(false);
struct Data {
    int x;
    long long y;
    char const* z;
} data;

void thread0()
{
    // due to "release" the data will be written to memory
    // exactly in the following order: x -> y -> z
    data.x = 1;
    data.y = 100;
    data.z = "foo";
    // there can be an arbitrary delay between the write 
    // to any of the members and it's visibility in other 
    // threads (which don't synchronize explicitly)

    // atomic_bool guarantees that the write to the "a_flag"
    // will be clean, thus no other thread will ever read some
    // strange mixture of 4bit + 4bits
    a_flag.store(true, std::memory_order_release);
}

void thread1()
{
    while (a_flag.load(std::memory_order_acquire) == false) {};
    // "acquire" on a "released" atomic guarantees that all the writes from 
    // thread0 (thus data members modification) will be visible here
}

void thread2()
{
    while (data.y != 100) {};
    // not "acquiring" the "a_flag" doesn't guarantee that will see all the 
    // memory writes, but when I see the z == 100 I know I can assume that 
    // prior writes have been done due to "release ordering" => assert(x == 1)
}

int main()
{
    thread0(); // concurrently
    thread1(); // concurrently
    thread2(); // concurrently

    // join

    return 0;
}

First, please validate my assumptions in code (especially thread2).

Second, my questions are:

  1. How does the a_flag write propagate to other cores?

  2. Does the std::atomic synchronize the a_flag in the writer cache with the other cores cache (using MESI, or anything else), or the propagation is automatic?

  3. Assuming that on a particular machine a write to a flag is atomic (think int_32 on x86) AND we don't have any private memory to synchronize (we only have a flag) do we need to use atomics?

  4. Taking into consideration most popular CPU architectures (x86, x64, ARM v.whatever, IA-64), is the cross-core visibility (I am now not considering reorderings) automatic (but potentially delayed), or you need to issue specific commands to propagate any piece of data?

like image 517
Red XIII Avatar asked Oct 17 '13 07:10

Red XIII


1 Answers

  1. Cores themselves don't matter. The question is "how do all cores see the same memory update eventually", which is something your hardware does for you (e.g. cache coherency protocols). There is only one memory, so the main concern is caching, which is a private concern of the hardware.

  2. That question seems unclear. What matters is the acquire-release pair formed by the load and store of a_flag, which is a synchronisation point and causes the effects of thread0 and thread1 to appear in a certain order (i.e. everything in thread0 before the store happens-before everything after the loop in thread1).

  3. Yes, otherwise you wouldn't have synchronisation point.

  4. You don't need any "commands" in C++. C++ isn't even aware of the fact that it's running on any particular kind of CPU. You could probably run a C++ program on a Rubik's cube with enough imagination. A C++ compiler chooses the necessary instructions to implement the synchronisation behaviour that's described by the C++ memory model, and on x86 that involves issuing instruction lock prefixes and memory fences, as well as not reordering instructions too much. Since x86 has a strongly ordered memory model, the above code should produce minimal additional code compared to the naive, incorrect one without atomics.

  5. Having your thread2 in the code makes the entire program undefined behaviour.


Just for fun, and to show that working out what's happening for yourself can be edifying, I compiled the code in three variations. (I added a glbbal int x and in thread1 I added x = data.y;).

Acquire/Release: (your code)

thread0:
    mov DWORD PTR data, 1
    mov DWORD PTR data+4, 100
    mov DWORD PTR data+8, 0
    mov DWORD PTR data+12, OFFSET FLAT:.LC0
    mov BYTE PTR a_flag, 1
    ret

thread1:
.L14:
    movzx   eax, BYTE PTR a_flag
    test    al, al
    je  .L14
    mov eax, DWORD PTR data+4
    mov DWORD PTR x, eax
    ret

Sequentially consistent: (remove the explicit ordering)

thread0:
    mov eax, 1
    mov DWORD PTR data, 1
    mov DWORD PTR data+4, 100
    mov DWORD PTR data+8, 0
    mov DWORD PTR data+12, OFFSET FLAT:.LC0
    xchg    al, BYTE PTR a_flag
    ret

thread1:
.L14:
    movzx   eax, BYTE PTR a_flag
    test    al, al
    je  .L14
    mov eax, DWORD PTR data+4
    mov DWORD PTR x, eax
    ret

"Naive": (just using bool)

thread0:
    mov DWORD PTR data, 1
    mov DWORD PTR data+4, 100
    mov DWORD PTR data+8, 0
    mov DWORD PTR data+12, OFFSET FLAT:.LC0
    mov BYTE PTR a_flag, 1
    ret

thread1:
    cmp BYTE PTR a_flag, 0
    jne .L3
.L4:
    jmp .L4
.L3:
    mov eax, DWORD PTR data+4
    mov DWORD PTR x, eax
    ret

As you can see, there's not a big difference. The "incorrect" version actually looks mostly correct, except for missing the load (it uses cmp with memory operand). The sequentially consistent version hides its expensiveness in the xcgh instruction, which has an implicit lock prefix and doesn't seem to require any explicit fences.

like image 123
Kerrek SB Avatar answered Oct 12 '22 15:10

Kerrek SB