Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Hoisting of non-atomic loads up through acquiring atomic loads

I was under the impression that memory loads could not be hoisted above an acquiring load in the C++11 memory model. However looking at the code that gcc 4.8 produces that only seems to be true for other atomic loads, not all of memory. If that's true and acquiring loads don't synchronize all memory (just std::atomics) then I'm not sure how it would be possible to implement general purpose mutexes in terms of std::atomic.

The following code:

extern std::atomic<unsigned> seq;
extern std::atomic<int> data;

int reader() {
    int data_copy;
    unsigned seq0;
    unsigned seq1;
    do {
        seq0 = seq.load(std::memory_order_acquire);
        data_copy = data.load(std::memory_order_relaxed);
        std::atomic_thread_fence(std::memory_order_acquire);
        seq1 = seq.load(std::memory_order_relaxed);
    } while (seq0 != seq1);
    return data_copy;
}

Produces:

_Z6readerv:
.L3:
    mov ecx, DWORD PTR seq[rip]
    mov eax, DWORD PTR data[rip]
    mov edx, DWORD PTR seq[rip]
    cmp ecx, edx
    jne .L3
    rep ret

Which looks correct to me.

However changing data to be an int rather than std::atomic:

extern std::atomic<unsigned> seq;
extern int data;

int reader() {
    int data_copy;
    unsigned seq0;
    unsigned seq1;
    do {
        seq0 = seq.load(std::memory_order_acquire);
        data_copy = data;
        std::atomic_thread_fence(std::memory_order_acquire);
        seq1 = seq.load(std::memory_order_relaxed);
    } while (seq0 != seq1);
    return data_copy;
}

Produces this:

_Z6readerv:
    mov eax, DWORD PTR data[rip]
.L3:
    mov ecx, DWORD PTR seq[rip]
    mov edx, DWORD PTR seq[rip]
    cmp ecx, edx
    jne .L3
    rep ret

So what's going on?

like image 319
jleahy Avatar asked May 23 '13 16:05

jleahy


2 Answers

Why a load was hoisted above an acquire

I've posted this on the gcc bugzilla and they've confirmed it as a bug.

the MEM alias-set of -1 (ALIAS_SET_MEMORY_BARRIER) is supposed to prevent this, but PRE does not know about this special property (it should "kill" all refs crossing it).

It looks like the gcc wiki has a nice page about this.

Generally, release is a barrier to sinking code, acquire is a barrier to hoisting code.

Why this code is still broken

As per this paper my code is still incorrect, because it introduces a data race. Even though the patched gcc generates the correct code, it's still not proper to access data without wrapping it in std::atomic. The reason is that data races are undefined behavior, even if computations resulting from them are discarded.

An example courtesy of AdamH.Peterson:

int foo(unsigned x) {
    if (x < 10) {
        /* some calculations that spill all the 
           registers so x has to be reloaded below */
        switch (x) {
        case 0:
            return 5;
        case 1:
            return 10;
        // ...
        case 9:
            return 43;
        }
    }
    return 0;
}

Here a compiler might optimise the switch into a jump table, and thanks to the if statement above would be able to avoid a range check. However if data races were not undefined behaviour then an second range check would be required.

like image 124
jleahy Avatar answered Oct 03 '22 18:10

jleahy


I do not think your atomic_thread_fence is correct. The only C++11 memory model that works with your code would be seq_cst one. But this is very expensive (you're going to get a full memory fence) for what you need.

The original code works and I think this is the best performance tradeoff.

EDIT based on your updates:

If you're looking for the formal reason why the code with a regular int is not working the way you'd like, I believe the very paper you quoted (http://www.hpl.hp.com/techreports/2012/HPL-2012-68.pdf) gives the answer. Look at the end of section 2. Your code has the same problem as the code in Figure 1. It has data races. Multiple threads can do operations on the same memory on the regular int at the same time. It it forbidden by the c++11 memory model, this code is formally not valid C++ code.

gcc expects the code to have no data race, i.e being valid C++ code. Since there is no race and the code loads the int unconditionally, a load can be emitted anywhere in the body of the function. So gcc is smart and it just emits it once since it is not volatile. The conditional statement that usually goes hand in hand with an acquire barrier plays an important role in what the compiler will do.

In the formal slang of the standard, the atomic loads and the regular int loads are unsequenced. The introduction for example of a condition would create a sequence point and would force the compiler to evaluate the regular int after the sequence point (http://msdn.microsoft.com/en-us/library/d45c7a5d.aspx). Then the c++ memory model would do the rest (i.e ensure the visibility by the cpu executing the instructions)

So neither of your statements are true. You can definitely build a lock with c++11, just not one with data races :-) Typically a lock would involve waiting before reading (which is obviously what you're trying to avoid here) so you do not have this kind of problems.

Note that your original seqlock is buggy because you would not want to just check seq0 != seq1 (you could be in the middle of an update). The seqlock paper has the correct condition.

like image 36
Guillaume Avatar answered Oct 03 '22 17:10

Guillaume