AFAIK C++ atomics (<atomic>
) family provide 3 benefits:
And I am not sure about the third bullet, thus take a look at the following example.
#include <atomic>
std::atomic_bool a_flag = ATOMIC_VAR_INIT(false);
struct Data {
int x;
long long y;
char const* z;
} data;
void thread0()
{
// due to "release" the data will be written to memory
// exactly in the following order: x -> y -> z
data.x = 1;
data.y = 100;
data.z = "foo";
// there can be an arbitrary delay between the write
// to any of the members and it's visibility in other
// threads (which don't synchronize explicitly)
// atomic_bool guarantees that the write to the "a_flag"
// will be clean, thus no other thread will ever read some
// strange mixture of 4bit + 4bits
a_flag.store(true, std::memory_order_release);
}
void thread1()
{
while (a_flag.load(std::memory_order_acquire) == false) {};
// "acquire" on a "released" atomic guarantees that all the writes from
// thread0 (thus data members modification) will be visible here
}
void thread2()
{
while (data.y != 100) {};
// not "acquiring" the "a_flag" doesn't guarantee that will see all the
// memory writes, but when I see the z == 100 I know I can assume that
// prior writes have been done due to "release ordering" => assert(x == 1)
}
int main()
{
thread0(); // concurrently
thread1(); // concurrently
thread2(); // concurrently
// join
return 0;
}
First, please validate my assumptions in code (especially thread2
).
Second, my questions are:
How does the a_flag
write propagate to other cores?
Does the std::atomic
synchronize the a_flag
in the writer cache with the other cores cache (using MESI, or anything else), or the propagation is automatic?
Assuming that on a particular machine a write to a flag is atomic (think int_32 on x86) AND we don't have any private memory to synchronize (we only have a flag) do we need to use atomics?
Taking into consideration most popular CPU architectures (x86, x64, ARM v.whatever, IA-64), is the cross-core visibility (I am now not considering reorderings) automatic (but potentially delayed), or you need to issue specific commands to propagate any piece of data?
Cores themselves don't matter. The question is "how do all cores see the same memory update eventually", which is something your hardware does for you (e.g. cache coherency protocols). There is only one memory, so the main concern is caching, which is a private concern of the hardware.
That question seems unclear. What matters is the acquire-release pair formed by the load and store of a_flag
, which is a synchronisation point and causes the effects of thread0
and thread1
to appear in a certain order (i.e. everything in thread0
before the store happens-before everything after the loop in thread1
).
Yes, otherwise you wouldn't have synchronisation point.
You don't need any "commands" in C++. C++ isn't even aware of the fact that it's running on any particular kind of CPU. You could probably run a C++ program on a Rubik's cube with enough imagination. A C++ compiler chooses the necessary instructions to implement the synchronisation behaviour that's described by the C++ memory model, and on x86 that involves issuing instruction lock prefixes and memory fences, as well as not reordering instructions too much. Since x86 has a strongly ordered memory model, the above code should produce minimal additional code compared to the naive, incorrect one without atomics.
Having your thread2
in the code makes the entire program undefined behaviour.
Just for fun, and to show that working out what's happening for yourself can be edifying, I compiled the code in three variations. (I added a glbbal int x
and in thread1
I added x = data.y;
).
Acquire/Release: (your code)
thread0:
mov DWORD PTR data, 1
mov DWORD PTR data+4, 100
mov DWORD PTR data+8, 0
mov DWORD PTR data+12, OFFSET FLAT:.LC0
mov BYTE PTR a_flag, 1
ret
thread1:
.L14:
movzx eax, BYTE PTR a_flag
test al, al
je .L14
mov eax, DWORD PTR data+4
mov DWORD PTR x, eax
ret
Sequentially consistent: (remove the explicit ordering)
thread0:
mov eax, 1
mov DWORD PTR data, 1
mov DWORD PTR data+4, 100
mov DWORD PTR data+8, 0
mov DWORD PTR data+12, OFFSET FLAT:.LC0
xchg al, BYTE PTR a_flag
ret
thread1:
.L14:
movzx eax, BYTE PTR a_flag
test al, al
je .L14
mov eax, DWORD PTR data+4
mov DWORD PTR x, eax
ret
"Naive": (just using bool
)
thread0:
mov DWORD PTR data, 1
mov DWORD PTR data+4, 100
mov DWORD PTR data+8, 0
mov DWORD PTR data+12, OFFSET FLAT:.LC0
mov BYTE PTR a_flag, 1
ret
thread1:
cmp BYTE PTR a_flag, 0
jne .L3
.L4:
jmp .L4
.L3:
mov eax, DWORD PTR data+4
mov DWORD PTR x, eax
ret
As you can see, there's not a big difference. The "incorrect" version actually looks mostly correct, except for missing the load (it uses cmp
with memory operand). The sequentially consistent version hides its expensiveness in the xcgh
instruction, which has an implicit lock prefix and doesn't seem to require any explicit fences.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With