I was under the impression that memory loads could not be hoisted above an acquiring load in the C++11 memory model. However looking at the code that gcc 4.8 produces that only seems to be true for other atomic loads, not all of memory. If that's true and acquiring loads don't synchronize all memory (just std::atomics
) then I'm not sure how it would be possible to implement general purpose mutexes in terms of std::atomic.
The following code:
extern std::atomic<unsigned> seq;
extern std::atomic<int> data;
int reader() {
int data_copy;
unsigned seq0;
unsigned seq1;
do {
seq0 = seq.load(std::memory_order_acquire);
data_copy = data.load(std::memory_order_relaxed);
std::atomic_thread_fence(std::memory_order_acquire);
seq1 = seq.load(std::memory_order_relaxed);
} while (seq0 != seq1);
return data_copy;
}
Produces:
_Z6readerv:
.L3:
mov ecx, DWORD PTR seq[rip]
mov eax, DWORD PTR data[rip]
mov edx, DWORD PTR seq[rip]
cmp ecx, edx
jne .L3
rep ret
Which looks correct to me.
However changing data to be an int
rather than std::atomic
:
extern std::atomic<unsigned> seq;
extern int data;
int reader() {
int data_copy;
unsigned seq0;
unsigned seq1;
do {
seq0 = seq.load(std::memory_order_acquire);
data_copy = data;
std::atomic_thread_fence(std::memory_order_acquire);
seq1 = seq.load(std::memory_order_relaxed);
} while (seq0 != seq1);
return data_copy;
}
Produces this:
_Z6readerv:
mov eax, DWORD PTR data[rip]
.L3:
mov ecx, DWORD PTR seq[rip]
mov edx, DWORD PTR seq[rip]
cmp ecx, edx
jne .L3
rep ret
So what's going on?
Why a load was hoisted above an acquire
I've posted this on the gcc bugzilla and they've confirmed it as a bug.
the MEM alias-set of -1 (ALIAS_SET_MEMORY_BARRIER) is supposed to prevent this, but PRE does not know about this special property (it should "kill" all refs crossing it).
It looks like the gcc wiki has a nice page about this.
Generally, release is a barrier to sinking code, acquire is a barrier to hoisting code.
Why this code is still broken
As per this paper my code is still incorrect, because it introduces a data race. Even though the patched gcc generates the correct code, it's still not proper to access data
without wrapping it in std::atomic
. The reason is that data races are undefined behavior, even if computations resulting from them are discarded.
An example courtesy of AdamH.Peterson:
int foo(unsigned x) {
if (x < 10) {
/* some calculations that spill all the
registers so x has to be reloaded below */
switch (x) {
case 0:
return 5;
case 1:
return 10;
// ...
case 9:
return 43;
}
}
return 0;
}
Here a compiler might optimise the switch into a jump table, and thanks to the if statement above would be able to avoid a range check. However if data races were not undefined behaviour then an second range check would be required.
I do not think your atomic_thread_fence is correct. The only C++11 memory model that works with your code would be seq_cst one. But this is very expensive (you're going to get a full memory fence) for what you need.
The original code works and I think this is the best performance tradeoff.
EDIT based on your updates:
If you're looking for the formal reason why the code with a regular int is not working the way you'd like, I believe the very paper you quoted (http://www.hpl.hp.com/techreports/2012/HPL-2012-68.pdf) gives the answer. Look at the end of section 2. Your code has the same problem as the code in Figure 1. It has data races. Multiple threads can do operations on the same memory on the regular int at the same time. It it forbidden by the c++11 memory model, this code is formally not valid C++ code.
gcc expects the code to have no data race, i.e being valid C++ code. Since there is no race and the code loads the int unconditionally, a load can be emitted anywhere in the body of the function. So gcc is smart and it just emits it once since it is not volatile. The conditional statement that usually goes hand in hand with an acquire barrier plays an important role in what the compiler will do.
In the formal slang of the standard, the atomic loads and the regular int loads are unsequenced. The introduction for example of a condition would create a sequence point and would force the compiler to evaluate the regular int after the sequence point (http://msdn.microsoft.com/en-us/library/d45c7a5d.aspx). Then the c++ memory model would do the rest (i.e ensure the visibility by the cpu executing the instructions)
So neither of your statements are true. You can definitely build a lock with c++11, just not one with data races :-) Typically a lock would involve waiting before reading (which is obviously what you're trying to avoid here) so you do not have this kind of problems.
Note that your original seqlock is buggy because you would not want to just check seq0 != seq1 (you could be in the middle of an update). The seqlock paper has the correct condition.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With