Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can volatile but unfenced reads yield indefinitely stale values? (on real hardware)

In answering this question a further question about the OP's situation came up that I was unsure about: it's mostly a processor architecture question, but with a knock-on question about the C++ 11 memory model as well.

Basically, the OP's code was looping infinitely at higher optimization levels because of the following code (slightly modified for simplicity):

while (true) {
    uint8_t ov = bits_; // bits_ is some "uint8_t" non-local variable
    if (ov & MASK) {
        continue;
    }
    if (ov == __sync_val_compare_and_swap(&bits_, ov, ov | MASK)) {
        break;
    }
}

where __sync_val_compare_and_swap() is GCC's atomic CAS built-in. GCC (reasonably) optimized this into an infinite loop in the case that bits_ & mask was detected to be true before entering the loop, skipping the CAS operation entirely, so I suggested the following change (which works):

while (true) {
    uint8_t ov = bits_; // bits_ is some "uint8_t" non-local variable
    if (ov & MASK) {
        __sync_synchronize();
        continue;
    }
    if (ov == __sync_val_compare_and_swap(&bits_, ov, ov | MASK)) {
        break;
    }
}

After I answered, OP noted that changing bits_ to volatile uint8_t seems to work as well. I suggested not to go that route, since volatile should not normally be used for synchronization, and there doesn't seem to be much downside to using a fence here anyway.

However, I thought about it more, and in this case the semantics are such that it doesn't really matter if the ov & MASK check is based on a stale value, as long as it's not based on an indefinitely stale one (i.e. as long as the loop is broken eventually), since the actual attempt to update bits_ is synchronized. So is volatile enough here to guarantee that this loop terminates eventually if bits_ is updated by another thread such that bits_ & MASK == false, for any existent processor? In other words, in the absence of an explicit memory fence, is it practically possible for reads not optimized out by the compiler to be effectively optimized out by the processor instead, indefinitely? (EDIT: To be clear, I'm asking here about what modern hardware might actually do given the assumption that reads are emitted in a loop by the compiler, so it's not technically a language question although expressing it in terms of C++ semantics is convenient.)

That's the hardware angle to it, but to update it slightly and make it also an answerable question about the C++11 memory model as well, consider the following variation to the code above:

// bits_ is "std::atomic<unsigned char>"
unsigned char ov = bits_.load(std::memory_order_relaxed);
while (true) {
    if (ov & MASK) {
        ov = bits_.load(std::memory_order_relaxed);
        continue;
    }
    // compare_exchange_weak also updates ov if the exchange fails
    if (bits_.compare_exchange_weak(ov, ov | MASK, std::memory_order_acq_rel)) {
        break;
    }
}

cppreference claims that std::memory_order_relaxed implies "no constraints on reordering of memory accesses around the atomic variable", so independent of what actual hardware will or will not do, does imply that bits_.load(std::memory_order_relaxed) could technically never read an updated value after bits_ is updated on another thread in a conforming implementation?

EDIT: I found this in the standard (29.4 p13):

Implementations should make atomic stores visible to atomic loads within a reasonable amount of time.

So apparently waiting "infinitely long" for an updated value is (mostly?) out of the question, but there's no hard guarantee of any specific time interval of freshness other than that is should be "reasonable"; still, the question about actual hardware behavior stands.

like image 315
Stephen Lin Avatar asked Mar 17 '13 22:03

Stephen Lin


3 Answers

C++11 atomics deal with three issues:

  1. ensuring that a complete value is read or written without a thread switch; this prevents tearing.

  2. ensuring that the compiler does not re-order instructions within a thread across an atomic read or write; this ensures ordering within the thread.

  3. ensuring (for appropriate choices of memory order parameters) that data written within a thread prior to an atomic write will be seen by a thread that reads the atomic variable and sees the value that was written. This is visibility.

When you use memory_order_relaxed you don't get a guarantee of visibility from the relaxed store or load. You do get the first two guarantees.

Implementations "should" (i.e. are encouraged to) make memory writes visible within a reasonable amount of time, even with relaxed ordering. That's about the best that can be said; sooner or later these things should show up.

So, yes, formally, an implementation that never made relaxed writes visible to relaxed reads conforms to the language definition. In practice, this won't happen.

As to what volatile does, ask your compiler vendor. It's up to the implementation.

like image 132
Pete Becker Avatar answered Nov 11 '22 21:11

Pete Becker


It is technically legal for std::memory_order_relaxed loads to never, ever return a new value for the load. As for whether any implementation will do this, I have no clue.

Reference: http://www.developerfusion.com/article/138018/memory-ordering-for-atomic-operations-in-c0x/ "The only requirement is that accesses to a single atomic variable from the same thread can’t be reordered: once a given thread has seen a particular value of an atomic variable, a subsequent read by that thread can’t retrieve an earlier value of the variable."

like image 37
Patashu Avatar answered Nov 11 '22 20:11

Patashu


If the processors don't have cache-coherence protocol or have it very simple then it can 'optimize' the loads fetching stale data from cache. Now most modern multi-core CPUs implement cache coherency protocol. However the ARM before A9 did not have it. Non-CPU architectures also might not have cache-coherency (although they would arguably don't adhere to C++ memory model).

Other issue is that many architecture (including ARM and x86) allow reordering memory access. I don't know whether the processors are smart enough to notice repeated accesses to same address but I doubt it (it costs space and time for rare case, as compiler should be able to notice it, with small benefits, as later accesses will likely be L1 hits) but technically it can speculate that branch will be taken and it can reorder second access before first one (unlikely but if I read Intel and ARM manual correctly this is allowed).

Finally there are external devices which does not adhere to cache-coherency. If CPU communicates by memory mapped IO/DMA then the page must be marked as non-cachable (otherwise in L1/L2/L3/... cache would be staled data). In such cases processor will not usually reorder read and writes (for details consult your processor manual - it might have more fine-grained control) - compiler can so you need to use volatile. However as atomics are usually cache-based you don't need or can use them.

I am afraid that I cannot answer if such strong cache coherency will be available in future processors. I would suggest strictly following the specification ("What's wrong in storing pointer in int? Surely noone will ever user more then 4GiB so 32b address is big enough."). The correctness was answered by others so I won't include it.

like image 4
Maciej Piechotka Avatar answered Nov 11 '22 21:11

Maciej Piechotka