Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Memory ordering behavior of std::atomic::load

Am I wrong to assume that the atomic::load should also act as a memory barrier ensuring that all previous non-atomic writes will become visible by other threads?

To illustrate:

volatile bool arm1 = false;
std::atomic_bool arm2 = false;
bool triggered = false;

Thread1:

arm1 = true;
//std::std::atomic_thread_fence(std::memory_order_seq_cst); // this would do the trick 
if (arm2.load())
    triggered = true;

Thread2:

arm2.store(true);
if (arm1)
    triggered = true;

I expected that after executing both 'triggered' would be true. Please don't suggest to make arm1 atomic, the point is to explore the behavior of atomic::load.

While I have to admit I don't fully understand the formal definitions of the different relaxed semantics of memory order I thought that the sequentially consistent ordering was pretty straightforward in that it guarantees that "a single total order exists in which all threads observe all modifications in the same order." To me this implies that the std::atomic::load with the default memory order of std::memory_order_seq_cst will also act as a memory fence. This is further corroborated by the following statement under "Sequentially-consistent ordering":

Total sequential ordering requires a full memory fence CPU instruction on all multi-core systems.

Yet, my simple example below demonstrates this is not the case with MSVC 2013, gcc 4.9 (x86) and clang 3.5.1 (x86), where the atomic load simply translates to a load instruction.

#include <atomic>

std::atomic_long al;

#ifdef _WIN32
__declspec(noinline)
#else
__attribute__((noinline))
#endif
long load() {
    return al.load(std::memory_order_seq_cst);
}

int main(int argc, char* argv[]) {
    long r = load();
}

With gcc this looks like:

load():
   mov  rax, QWORD PTR al[rip]   ; <--- plain load here, no fence or xchg
   ret
main:
   call load()
   xor  eax, eax
   ret

I'll omit the msvc and clang which are essentially identical. Now on gcc for ARM we get what I expected:

load():
     dmb    sy                         ; <---- data memory barrier here
     movw   r3, #:lower16:.LANCHOR0
     movt   r3, #:upper16:.LANCHOR0
     ldr    r0, [r3]                   
     dmb    sy                         ; <----- and here
     bx lr
main:
    push    {r3, lr}
    bl  load()
    movs    r0, #0
    pop {r3, pc}

This is not an academic question, it results in a subtle race condition in our code which called into question my understanding of the behavior of std::atomic.

like image 629
Alf Avatar asked Feb 28 '15 15:02

Alf


1 Answers

Sigh, this was too long for a comment:

Isn't the meaning of atomic "to appear to occur instantaneously to the rest of the system"?

I'd say yes and no to that one, depending on how you think of it. For writes with SEQ_CST, yes. But as far as how atomic loads are handled, check out 29.3 of the C++11 standard. Specifically, 29.3.3 is really good reading, and 29.3.4 might be specifically what you're looking for:

For an atomic operation B that reads the value of an atomic object M, if there is a memory_order_seq_- cst fence X sequenced before B, then B observes either the last memory_order_seq_cst modification of M preceding X in the total order S or a later modification of M in its modification order.

Basically, SEQ_CST forces a global order just like the standard says, but reads can return and old value without violating the 'atomic' constraint.

To accomplish 'getting the absolute latest value' you'll need to perform an operation that forces the hardware coherency protocol to lock(the lock instruction on x86_64). This is what the atomic compare-and-exchange operations do, if you look at the assembly output.

like image 98
Myles Hathcock Avatar answered Oct 13 '22 11:10

Myles Hathcock