I'm trying to demonstrate that it's very bad idea to not use std::atomic<>
s but I can't manage to create an example that reproduces the failure. I have two threads and one of them does:
{
foobar = false;
}
and the other:
{
if (foobar) {
// ...
}
}
the type of foobar
is either bool
or std::atomic_bool
and it's initialized to true
. I'm using OS X Yosemite and even tried to use this trick to hint via CPU affinity that I want the threads to run on different cores. I run such operations in loops etc. and in any case, there's no observable difference in execution. I end up inspecting generated assembly with clang clang -std=c++11 -lstdc++ -O3 -S test.cpp
and I see that the asm differences on read are minor (without atomic on left, with on right):
No mfence
or something that "dramatic". On the write side, something more "dramatic" happens:
As you can see, the atomic<>
version uses xchgb
which uses an implicit lock. When I compile with a relatively old version of gcc (v4.5.2) I can see all sorts of mfence
s being added which also indicates there's a serious concern.
I kind of understand that "X86 implements a very strong memory model" (ref) and that mfence
s might not be necessary but does it mean that unless I want to write cross-platform code that e.g. supports ARM, I don't really need to put any atomic<>
s unless I care for consistency at ns-level?
I've watched "atomic<> Weapons" from Herb Sutter but I'm still impressed with how difficult it is to create a simple example that reproduces those problems.
Kernel threads are supported directly by the operating system. Any application can be programmed to be multithreaded. All of the threads within an application are supported within a single process.
C++ 11 did away with all that and gave us std::thread. The thread classes and related functions are defined in the thread header file. std::thread is the thread class that represents a single thread in C++.
Only one thread can access the Kernel at a time, so multiple threads are unable to run in parallel on multiprocessors. If the user-level thread libraries are implemented in the operating system in such a way that the system does not support them, then the Kernel threads use the many-to-one relationship modes.
Threads are implemented in following two ways − User Level Threads − User managed threads. Kernel Level Threads − Operating System managed threads acting on kernel, an operating system core.
The big problem of data races is that they're undefined behavior, not guaranteed wrong behavior. And this, in conjunction with the the general unpredictability of threads and the strength of the x64 memory model, means that it gets really hard to create reproduceable failures.
A slightly more reliable failure mode is when the optimizer does unexpected things, because you can observe those in the assembly. Of course, the optimizer is notoriously finicky as well and might do something completely different if you change just one code line.
Here's an example failure that we had in our code at one point. The code implemented a sort of spin lock, but didn't use atomics.
bool operation_done;
void thread1() {
while (!operation_done) {
sleep();
}
// do something that depends on operation being done
}
void thread2() {
// do the operation
operation_done = true;
}
This worked fine in debug mode, but the release build got stuck. Debugging showed that execution of thread1 never left the loop, and looking at the assembly, we found that the condition was gone; the loop was simply infinite.
The problem was that the optimizer realized that under its memory model, operation_done
could not possibly change within the loop (that would have been a data race), and thus it "knew" that once the condition was true once, it would be true forever.
Changing the type of operation_done to atomic_bool
(or actually, a pre-C++11 compiler-specific equivalent) fixed the issue.
This is my own version of @Sebastian Redl's answer that fits the question more closely. I will still accept his for credit + kudos to @HansPassant for his comment which brought my attention back to writes which made everything clear - since as soon as I observed that the compiler was adding synchronization on writes, the problem turned to be that it wasn't optimizing bool
as much as one would expect.
I was able to have a trivial program that reproduces the problem:
std::atomic_bool foobar(true);
//bool foobar = true;
long long cnt = 0;
long long loops = 400000000ll;
void thread_1() {
usleep(200000);
foobar = false;
}
void thread_2() {
while (loops--) {
if (foobar) {
++cnt;
}
}
std::cout << cnt << std::endl;
}
The main difference with my original code was that I used to have a usleep()
inside the while
loop. It was enough to prevent any optimizations within the while
loop. The cleaner code above, yields the same asm for write:
but quite different for read:
We can see that in the bool
case (left) clang brought the if (foobar)
outside the loop. Thus when I run the bool
case I get:
400000000
real 0m1.044s
user 0m1.032s
sys 0m0.005s
while when I run the atomic_bool
case I get:
95393578
real 0m0.420s
user 0m0.414s
sys 0m0.003s
It's interesting that the atomic_bool
case is faster - I guess because it does just 95 million inc
s on the counter contrary to 400 million in the bool
case.
What is even more crazy-interesting though is this. If I move the std::cout << cnt << std::endl;
out of the threaded code, after pthread_join()
, the loop in the non-atomic case becomes just this:
i.e. there's no loop. It's just if (foobar!=0) cnt = loops;
! Clever clang. Then the execution yields:
400000000
real 0m0.206s
user 0m0.001s
sys 0m0.002s
while the atomic_bool
remains the same.
So more than enough evidence that we should use atomic
s. The only thing to remember is - don't put any usleep()
on your benchmarks because even if it's small, it will prevent quite a few compiler optimizations.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With