Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Does the C11 memory model really conflict with common optimizations?

The OP of a recent question added a comment to it linking a paper entitled Common Compiler Optimisations are Invalid in the C11 Memory Model and what we can do about it, which apparently was presented at POPL 2015. Among other things, it purports to show several unexpected and counterintuitive conclusions derived from the specifications for what it calls the "C11 memory model", which I take to consist largely of the provisions of section 5.1.2.4 of the C11 language specification.

The paper is somewhat lengthy, but for the purposes of this question I focus on the discussion of scheme "SEQ" on the second page. This concerns a multithreaded program in which ...

  • a is non-atomic (for example, an int),
  • x and y are atomic (for example, _Atomic int), and
  • a, x, and y all initially have value 0,

... and the following occurs (transliterated from pseudocode):


Thread 1

a = 1;

Thread 2

if (atomic_load_explicit(&x, memory_order_relaxed))
    if (a)
        atomic_store_explicit(&y, 1, memory_order_relaxed);

Thread 3

if (atomic_load_explicit(&y, memory_order_relaxed))
    atomic_store_explicit(&x, 1, memory_order_relaxed);

The paper makes this argument:

First, notice that there is no execution (consistent execution in the terminology of Section 2) in which the load of a occurs. We show this by contradiction. Suppose that there is an execution in which a load of a occurs. In such an execution the load of a can only return 0 (the initial value of a) because the store a=1 does not happen before it (because it is in a different thread that has not been synchronised with) and non-atomic loads must return the latest write that happens before them. Therefore, in this execution the store to y does not happen, which in turn means that the load of y cannot return 1 and the store to x also does not happen. Then, x cannot read 1, and thus the load of a does not occur. As a consequence this program is not racy: since the load of a does not occur in any execution, there are no executions with conflicting accesses on the same non-atomic variable. We conclude that the only possible final state is a=1 ∧ x=y=0.

(Question 1) But isn't that argument fatally flawed?

The assertion that the load of a can only return 0 is made subject to the assumption that a is in fact read, which the argument intends to contradict. But in that case, as the paper observes, there is no happens before relationship between the store to a in thread 1 and the load from a in thread 2. These are conflicting accesses, neither is atomic, and one is a write. Therefore, per paragraph 5.1.2.4/25, the program contains a data race resulting in undefined behavior. Because the behavior is undefined, nothing can be concluded about the value loaded from a by thread 2, and in particular, it cannot be concluded from the specification that the load must return 0. The rest of the argument then collapses.

Although the paper claims that the argument shows that the program does not contain a data race ("is not racy"), in fact that is not a consequence of the argument but rather a hidden assumption. Only if, contrary to 5.1.2.4/25, the program did not contain a data race would the argument stand up.

Now perhaps the key is that the argument above considers only "consistent executions", a term defined in a later section of the paper. I confess that it gets a little deep for me at that point, but if in fact constraining the behavior to consistent executions is sufficient to support the assertion that the load of a must return 0, then it seems that it is no longer (just) the rules of the C11 memory model that we are talking about.

This matters because the authors conclude that a source-to-source translation combining threads 1 & 2 to yield ...


Thread 2'

a = 1;
if (atomic_load_explicit(&x, memory_order_relaxed))
    if (a)
        atomic_store_explicit(&y, 1, memory_order_relaxed);

... is not permitted of implementations by the C11 memory model, on the basis that it allows executions in which the final state is a = x = y = 1. That this and various other code transformations and optimizations are invalid is the thesis of the paper.

(Question 2) But isn't it indeed valid for a C11 implementation to treat the original three-threaded program as if it were the two-threaded program consisting of threads 2' and 3?

If that allows "consistent executions" with results that could not be produced by any consistent execution of the original program, then I think that just shows that C11 does not constrain programs to exhibiting consistent executions. Am I missing something here?

like image 286
John Bollinger Avatar asked Jul 23 '21 21:07

John Bollinger


Video Answer


1 Answers

Apparently, no one is both interested enough and confident enough to write an answer, so I guess I'll go ahead.

isn't that argument fatally flawed?

To the extent that the proof quoted from the paper is intended to demonstrate that a conforming C implementation is not permitted to perform the source-to-source transformation described in the question, or an equivalent, yes, the proof is flawed. The refutation presented in the question is sound.

There was some discussion in comments about how the refutation could be viewed as boiling down to anything being permissible in the event of undefined behavior. That is a valid perspective, and in no way does it undercut the argument. However, I think it's unnecessarily minimalistic.

Again, the key problem with the paper's proof is here:

the load of a can only return 0 (the initial value of a) because the store a=1 does not happen before it (because it is in a different thread that has not been synchronised with) and non-atomic loads must return the latest write that happens before them.

The proof's error is that the language specification's requirement that a read of a must return the result of a write to a that "happened before" it is conditioned on the program being free of data races. This is an essential foundation for the whole model, not some kind of escape hatch. The program manifestly is not free of data races if in fact the read of a is performed, so the requirement is moot in that case. The read of a by thread 2 absolutely can observe the write by thread 1, and there is good reason to suppose that it might sometimes do so in practice.

To look at it another way, the proof chooses to focus on the write not happening before the read, but ignores the fact that the read also does not happen before the write.

Taking the relaxed atomic accesses into account does not change anything. It is plausible that in a real execution of the paper's three-threaded program, the implementation (for example) speculatively executes the relaxed load of x in thread 2 on the assumption that it will return 1, then reads from a the value written by thread 1, and as a result, executes the store to y. Because the atomic accesses are performed with relaxed semantics, the execution of thread 3 can read the value of y as 1 (or speculate that it will do so) and consequently perform the write to x. All speculations involved can then be confirmed correct, with the final result that a = x = y = 1. It is intentional that this seemingly paradoxical result is allowed by the "relaxed" memory order.

isn't it indeed valid for a C11 implementation to treat the original three-threaded program as if it were the two-threaded program consisting of threads 2' and 3?

At minimum, the paper's argument does not show otherwise, even if we -- with no basis in the specification -- construe the scope of the UB arising from the data race to be limited to whether the value read from a is its initial one or the one written by thread 1.

Implementations are given broad license to behave as they choose, so long as they produce observable behavior that is consistent with the behavior required of the abstract machine. The creation and execution of multiple threads of execution is not itself part of the observable behavior of a program, as that is defined by the specification. Therefore, yes, a program that performed the proposed transformation and then behaved accordingly, or one that otherwise behaved as if there were a happens before edge between the write to a and the read from a, would not be acting inconsistently with the specification.

like image 149
John Bollinger Avatar answered Oct 25 '22 02:10

John Bollinger