I'm writing some lock-free code, and I came up with an interesting pattern, but I'm not sure if it will behave as expected under relaxed memory ordering. The simplest way to explain it is using an example: <pre class="prettyprint"><code>std::atomic<int> a, b, c; auto a_local = a.load(std::memory_order_relaxed); auto b_local = b.load(std::memory_order_relaxed); if (a_local < b_local) { auto c_local = c.fetch_add(1, std::memory_order_relaxed); } </code></pre> Note that all operations use <code>std::memory_order_relaxed</code>. Obviously, on the thread that this is executed on, the loads for <code>a</code> and <code>b</code> must be done before the <code>if</code> condition is evaluated. Similarly, the read-modify-write (RMW) operation on <code>c</code> must be done after the condition is evaluated (because it's conditional on that... condition). What I want to know is, does this code guarantee that the value of <code>c_local</code> is at least as up-to-date as the values of <code>a_local</code> and <code>b_local</code>? If so, how is this possible given the relaxed memory ordering? Is the control dependency together with the RWM operation acting as some sort of acquire fence? (Note that there's not even a corresponding release anywhere.) If the above holds true, I believe this example should also work (assuming no overflow) -- am I right? <pre class="prettyprint"><code>std::atomic<int> a(0), b(0); // Thread 1 while (true) { auto a_local = a.fetch_add(1, std::memory_order_relaxed); if (a_local >= 0) { // Always true at runtime b.fetch_add(1, std::memory_order_relaxed); } } // Thread 2 auto b_local = b.load(std::memory_order_relaxed); if (b_local < 777) { // Note that fetch_add returns the pre-incrementation value auto a_local = a.fetch_add(1, std::memory_order_relaxed); assert(b_local <= a_local); // Is this guaranteed? } </code></pre> On thread 1, there is a control dependency which I suspect guarantees that <code>a</code> is always incremented before <code>b</code> is incremented (but they each keep being incremented neck-and-neck). On thread 2, there is another control dependency which I suspect guarantees that <code>b</code> is loaded into <code>b_local</code> before <code>a</code> is incremented. I also think that the value returned from <code>fetch_add</code> will be at least as recent as any observed value in <code>b_local</code>, and the <code>assert</code> should therefore hold. But I'm not sure, since this departs significantly from the usual memory-ordering examples, and my understanding of the C++11 memory model is not perfect (I have trouble reasoning about these memory ordering effects with any degree of certainty). Any insights would be appreciated! <hr> Update: As bames53 has helpfully pointed out in the comments, given a sufficiently smart compiler, it's possible that an <code>if</code> could be optimised out entirely under the right circumstances, in which case the relaxed loads could be reordered to occur after the RMW, causing their values to be more up-to-date than the <code>fetch_add</code> return value (the <code>assert</code> could fire in my second example). However, what if instead of an <code>if</code>, an <code>atomic_signal_fence</code> (not <code>atomic_thread_fence</code>) is inserted? That certainly can't be ignored by the compiler no matter what optimizations are done, but does it ensure that the code behaves as expected? Is the CPU allowed to do any re-ordering in such a case? The second example then becomes: <pre class="prettyprint"><code>std::atomic<int> a(0), b(0); // Thread 1 while (true) { auto a_local = a.fetch_add(1, std::memory_order_relaxed); std::atomic_signal_fence(std::memory_order_acq_rel); b.fetch_add(1, std::memory_order_relaxed); } // Thread 2 auto b_local = b.load(std::memory_order_relaxed); std::atomic_signal_fence(std::memory_order_acq_rel); // Note that fetch_add returns the pre-incrementation value auto a_local = a.fetch_add(1, std::memory_order_relaxed); assert(b_local <= a_local); // Is this guaranteed? </code></pre> <hr> Another update: After reading all the responses so far and combing through the standard myself, I don't think it can be shown that the code is correct using only the standard. So, can anyone come up with a counter-example of a theoretical system that complies with the standard and also fires the assert?

Signal fences don't provide the necessary guarantees (well, not unless 'thread 2' is a signal hander that actually runs on 'thread 1'). To guarantee correct behavior we need synchronization between threads, and the fence that does that is <code>std::atomic_thread_fence</code>. <hr> Let's label the statements so we can diagram various executions (with thread fences replacing signal fences, as required): <pre class="prettyprint"><code>while (true) { auto a_local = a.fetch_add(1, std::memory_order_relaxed); // A std::atomic_thread_fence(std::memory_order_acq_rel); // B b.fetch_add(1, std::memory_order_relaxed); // C } </code></pre> <pre class="prettyprint"><code>auto b_local = b.load(std::memory_order_relaxed); // X std::atomic_thread_fence(std::memory_order_acq_rel); // Y auto a_local = a.fetch_add(1, std::memory_order_relaxed); // Z </code></pre> So first let's assume that X loads a value written by C. The following paragraph specifies that in that case the fences synchronize and a happens-before relationship is established. 29.8/2: <blockquote> A release fence A synchronizes with an acquire fence B if there exist atomic operations X and Y, both operating on some atomic object M, such that A is sequenced before X, X modifies M, Y is sequenced before B, and Y reads the value written by X or a value written by any side effect in the hypothetical release sequence X would head if it were a release operation. </blockquote> And here's a possible execution order where the arrows are happens-before relations. <pre class="prettyprint"><code>Thread 1: A₁ → B₁ → C₁ → A₂ → B₂ → C₂ → ... ↘ Thread 2: X → Y → Z </code></pre> <blockquote> If a side effect X on an atomic object M happens before a value computation B of M, then the evaluation B shall take its value from X or from a side effect Y that follows X in the modification order of M. — [C++11 1.10/18] </blockquote> So the load at Z must take its value from A₁ or from a subsequent modification. Therefore the assert holds because the value written at A₁ and at all later modifications is greater than or equal to the value written at C₁ (and read by X). <hr> Now let's look at the case where the fences do not synchronize. This happens when the load of <code>b</code> does not load a value written by thread 1, but instead reads the value that <code>b</code> is initialized with. There's still synchronization where the threads starts though: 30.3.1.2/5 <blockquote> Synchronization: The completion of the invocation of the constructor synchronizes with the beginning of the invocation of the copy of f. </blockquote> This is specifying the behavior of <code>std::thread</code>'s constructor. So (assuming the thread creation is correctly sequenced after the initialization of <code>a</code>) the value read by Z must take its value from the initialization of <code>a</code> or from one of the subsequent modifications on thread 1, which means that the assertions still holds.

What are the C++11 memory ordering guarantees in this corner case?

Tags:

I'm writing some lock-free code, and I came up with an interesting pattern, but I'm not sure if it will behave as expected under relaxed memory ordering.

The simplest way to explain it is using an example:

std::atomic<int> a, b, c;

auto a_local = a.load(std::memory_order_relaxed);
auto b_local = b.load(std::memory_order_relaxed);
if (a_local < b_local) {
    auto c_local = c.fetch_add(1, std::memory_order_relaxed);
}

Note that all operations use std::memory_order_relaxed.

Obviously, on the thread that this is executed on, the loads for a and b must be done before the if condition is evaluated.

Similarly, the read-modify-write (RMW) operation on c must be done after the condition is evaluated (because it's conditional on that... condition).

What I want to know is, does this code guarantee that the value of c_local is at least as up-to-date as the values of a_local and b_local? If so, how is this possible given the relaxed memory ordering? Is the control dependency together with the RWM operation acting as some sort of acquire fence? (Note that there's not even a corresponding release anywhere.)

If the above holds true, I believe this example should also work (assuming no overflow) -- am I right?

std::atomic<int> a(0), b(0);

// Thread 1
while (true) {
    auto a_local = a.fetch_add(1, std::memory_order_relaxed);
    if (a_local >= 0) {    // Always true at runtime
        b.fetch_add(1, std::memory_order_relaxed);
    }
}

// Thread 2
auto b_local = b.load(std::memory_order_relaxed);
if (b_local < 777) {
    // Note that fetch_add returns the pre-incrementation value
    auto a_local = a.fetch_add(1, std::memory_order_relaxed);
    assert(b_local <= a_local);    // Is this guaranteed?
}

On thread 1, there is a control dependency which I suspect guarantees that a is always incremented before b is incremented (but they each keep being incremented neck-and-neck). On thread 2, there is another control dependency which I suspect guarantees that b is loaded into b_local before a is incremented. I also think that the value returned from fetch_add will be at least as recent as any observed value in b_local, and the assert should therefore hold. But I'm not sure, since this departs significantly from the usual memory-ordering examples, and my understanding of the C++11 memory model is not perfect (I have trouble reasoning about these memory ordering effects with any degree of certainty). Any insights would be appreciated!

Update: As bames53 has helpfully pointed out in the comments, given a sufficiently smart compiler, it's possible that an if could be optimised out entirely under the right circumstances, in which case the relaxed loads could be reordered to occur after the RMW, causing their values to be more up-to-date than the fetch_add return value (the assert could fire in my second example). However, what if instead of an if, an atomic_signal_fence (not atomic_thread_fence) is inserted? That certainly can't be ignored by the compiler no matter what optimizations are done, but does it ensure that the code behaves as expected? Is the CPU allowed to do any re-ordering in such a case?

The second example then becomes:

std::atomic<int> a(0), b(0);

// Thread 1
while (true) {
    auto a_local = a.fetch_add(1, std::memory_order_relaxed);
    std::atomic_signal_fence(std::memory_order_acq_rel);
    b.fetch_add(1, std::memory_order_relaxed);
}

// Thread 2
auto b_local = b.load(std::memory_order_relaxed);
std::atomic_signal_fence(std::memory_order_acq_rel);
// Note that fetch_add returns the pre-incrementation value
auto a_local = a.fetch_add(1, std::memory_order_relaxed);
assert(b_local <= a_local);    // Is this guaranteed?

Another update: After reading all the responses so far and combing through the standard myself, I don't think it can be shown that the code is correct using only the standard. So, can anyone come up with a counter-example of a theoretical system that complies with the standard and also fires the assert?

595

asked Aug 14 '13 04:08

Cameron

2 Answers

Signal fences don't provide the necessary guarantees (well, not unless 'thread 2' is a signal hander that actually runs on 'thread 1').

To guarantee correct behavior we need synchronization between threads, and the fence that does that is std::atomic_thread_fence.

Let's label the statements so we can diagram various executions (with thread fences replacing signal fences, as required):

while (true) {
    auto a_local = a.fetch_add(1, std::memory_order_relaxed); // A
    std::atomic_thread_fence(std::memory_order_acq_rel);      // B
    b.fetch_add(1, std::memory_order_relaxed);                // C
}

auto b_local = b.load(std::memory_order_relaxed);             // X
std::atomic_thread_fence(std::memory_order_acq_rel);          // Y
auto a_local = a.fetch_add(1, std::memory_order_relaxed);     // Z

So first let's assume that X loads a value written by C. The following paragraph specifies that in that case the fences synchronize and a happens-before relationship is established.

29.8/2:

A release fence A synchronizes with an acquire fence B if there exist atomic operations X and Y, both operating on some atomic object M, such that A is sequenced before X, X modifies M, Y is sequenced before B, and Y reads the value written by X or a value written by any side effect in the hypothetical release sequence X would head if it were a release operation.

And here's a possible execution order where the arrows are happens-before relations.

Thread 1: A₁ → B₁ → C₁ → A₂ → B₂ → C₂ → ...
                ↘
Thread 2:    X → Y → Z

If a side effect X on an atomic object M happens before a value computation B of M, then the evaluation B shall take its value from X or from a side effect Y that follows X in the modification order of M. — [C++11 1.10/18]

So the load at Z must take its value from A₁ or from a subsequent modification. Therefore the assert holds because the value written at A₁ and at all later modifications is greater than or equal to the value written at C₁ (and read by X).

Now let's look at the case where the fences do not synchronize. This happens when the load of b does not load a value written by thread 1, but instead reads the value that b is initialized with. There's still synchronization where the threads starts though:

30.3.1.2/5

Synchronization: The completion of the invocation of the constructor synchronizes with the beginning of the invocation of the copy of f.

This is specifying the behavior of std::thread's constructor. So (assuming the thread creation is correctly sequenced after the initialization of a) the value read by Z must take its value from the initialization of a or from one of the subsequent modifications on thread 1, which means that the assertions still holds.

171

answered May 25 '23 01:05

bames53

This example gets at a variation of reads-from-thin-air like behavior. The relevant discussion in the spec is in section 29.3p9-11. Since the current version of the C11 standard doesn't guarantee dependences be respected, the memory model should allow the assertion to be fired. The most likely situation is that the compiler optimizes away the check that a_local>=0. But even if you replace that check with a signal fence, CPUs would be free to reorder those instructions. You can test such code examples under the C/C++11 memory models using the open source CDSChecker tool. The interesting issue with your example is that for an execution to violate the assertion, there has to be a cycle of dependences. More concretely:

The b.fetch_add in thread one depends on the a.fetch_add in the same loop iteration due to the if condition. The a.fetch_add in thread 2 depends on b.load. For an assertion violation, we have to have T2's b.load read from a b.fetch_add in a later loop iteration than T2's a.fetch_add. Now consider the b.fetch_add the b.load reads from and call it # for future reference. We know that b.load depends on # as it takes it value from #.

We know that # must depend on T2's a.fetch_add as T2's a.fetch_add atomic reads and updates a prior a.fetch_add from T1 in the same loop iteration as #. So we know that # depends on the a.fetch_add in thread 2. That gives us a cycle in dependences and is plain weird, but allowed by the C/C++ memory model. The most likely way of actually producing that cycle is (1) compiler figures out that a.local is always greater than 0, breaking the dependence. It can then do loop unrolling and reorder T1's fetch_add however it wants.

answered May 25 '23 00:05

briand

Related questions
                            
                                Create dynamic queues with Celery
                            
                                Merge non-merged feature branch into another feature branch with Git
                            
                                How to start with STS and Gradle
                            
                                Python memory consumption on Linux: physical and virtual memory are growing while the heap size remains the same
                            
                                How to use iOS lock screen previous/next player buttons in Safari HTML5 player with Javascript
                            
                                facebook getting user id from graph api via taggable_friends
                            
                                change a git commit already pushed [duplicate]
                            
                                Benefits of using NSInteger over int?
                            
                                crash on coredata ios8
                            
                                Is there a way to prevent an iframe from redirecting parent window, but in such a way that "top level" redirects still work inside the iframe itself?
                            
                                Associating App with Epub format
                            
                                Dynamic resizing of ggvis plots in shiny apps

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With