I'm trying to demonstrate that it's very bad idea to not use <code>std::atomic<></code>s but I can't manage to create an example that reproduces the failure. I have two threads and one of them does: <pre class="prettyprint"><code>{ foobar = false; } </code></pre> and the other: <pre class="prettyprint"><code>{ if (foobar) { // ... } } </code></pre> the type of <code>foobar</code> is either <code>bool</code> or <code>std::atomic_bool</code> and it's initialized to <code>true</code>. I'm using OS X Yosemite and even tried to use this trick to hint via CPU affinity that I want the threads to run on different cores. I run such operations in loops etc. and in any case, there's no observable difference in execution. I end up inspecting generated assembly with clang <code>clang -std=c++11 -lstdc++ -O3 -S test.cpp</code> and I see that the asm differences on read are minor (without atomic on left, with on right): <img src="https://i.stack.imgur.com/KN0q2.png" alt="enter image description here"> No <code>mfence</code> or something that "dramatic". On the write side, something more "dramatic" happens: <img src="https://i.stack.imgur.com/f7EOe.png" alt="enter image description here"> As you can see, the <code>atomic<></code> version uses <code>xchgb</code> which uses an implicit lock. When I compile with a relatively old version of gcc (v4.5.2) I can see all sorts of <code>mfence</code>s being added which also indicates there's a serious concern. I kind of understand that "X86 implements a very strong memory model" (ref) and that <code>mfence</code>s might not be necessary but does it mean that unless I want to write cross-platform code that e.g. supports ARM, I don't really need to put any <code>atomic<></code>s unless I care for consistency at ns-level? I've watched "atomic<> Weapons" from Herb Sutter but I'm still impressed with how difficult it is to create a simple example that reproduces those problems.

This is my own version of @Sebastian Redl's answer that fits the question more closely. I will still accept his for credit + kudos to @HansPassant for his comment which brought my attention back to writes which made everything clear - since as soon as I observed that the compiler was adding synchronization on writes, the problem turned to be that it wasn't optimizing <code>bool</code> as much as one would expect. I was able to have a trivial program that reproduces the problem: <pre class="prettyprint"><code>std::atomic_bool foobar(true); //bool foobar = true; long long cnt = 0; long long loops = 400000000ll; void thread_1() { usleep(200000); foobar = false; } void thread_2() { while (loops--) { if (foobar) { ++cnt; } } std::cout << cnt << std::endl; } </code></pre> The main difference with my original code was that I used to have a <code>usleep()</code> inside the <code>while</code> loop. It was enough to prevent any optimizations within the <code>while</code> loop. The cleaner code above, yields the same asm for write: <img src="https://i.stack.imgur.com/NDei4.png" alt="enter image description here"> but quite different for read: <img src="https://i.stack.imgur.com/OYYBp.png" alt="enter image description here"> We can see that in the <code>bool</code> case (left) clang brought the <code>if (foobar)</code> outside the loop. Thus when I run the <code>bool</code> case I get: <pre class="prettyprint"><code>400000000 real 0m1.044s user 0m1.032s sys 0m0.005s </code></pre> while when I run the <code>atomic_bool</code> case I get: <pre class="prettyprint"><code>95393578 real 0m0.420s user 0m0.414s sys 0m0.003s </code></pre> It's interesting that the <code>atomic_bool</code> case is faster - I guess because it does just 95 million <code>inc</code>s on the counter contrary to 400 million in the <code>bool</code> case. What is even more crazy-interesting though is this. If I move the <code>std::cout << cnt << std::endl;</code> out of the threaded code, after <code>pthread_join()</code>, the loop in the non-atomic case becomes just this: <img src="https://i.stack.imgur.com/wBKyr.png" alt="enter image description here"> i.e. there's no loop. It's just <code>if (foobar!=0) cnt = loops;</code>! Clever clang. Then the execution yields: <pre class="prettyprint"><code>400000000 real 0m0.206s user 0m0.001s sys 0m0.002s </code></pre> while the <code>atomic_bool</code> remains the same. So more than enough evidence that we should use <code>atomic</code>s. The only thing to remember is - don't put any <code>usleep()</code> on your benchmarks because even if it's small, it will prevent quite a few compiler optimizations.

Are atomic types necessary in multi-threading? (OS X, clang, c++11)

Tags:

c++

multithreading

gcc

c++11

clang

I'm trying to demonstrate that it's very bad idea to not use std::atomic<>s but I can't manage to create an example that reproduces the failure. I have two threads and one of them does:

{
    foobar = false;
}

and the other:

{
    if (foobar) {
        // ...
    }
}

the type of foobar is either bool or std::atomic_bool and it's initialized to true. I'm using OS X Yosemite and even tried to use this trick to hint via CPU affinity that I want the threads to run on different cores. I run such operations in loops etc. and in any case, there's no observable difference in execution. I end up inspecting generated assembly with clang clang -std=c++11 -lstdc++ -O3 -S test.cpp and I see that the asm differences on read are minor (without atomic on left, with on right):

enter image description here

No mfence or something that "dramatic". On the write side, something more "dramatic" happens:

enter image description here

As you can see, the atomic<> version uses xchgb which uses an implicit lock. When I compile with a relatively old version of gcc (v4.5.2) I can see all sorts of mfences being added which also indicates there's a serious concern.

I kind of understand that "X86 implements a very strong memory model" (ref) and that mfences might not be necessary but does it mean that unless I want to write cross-platform code that e.g. supports ARM, I don't really need to put any atomic<>s unless I care for consistency at ns-level?

I've watched "atomic<> Weapons" from Herb Sutter but I'm still impressed with how difficult it is to create a simple example that reproduces those problems.

396

asked Sep 10 '16 18:09

neverlastn

Video Answer

2 Answers

The big problem of data races is that they're undefined behavior, not guaranteed wrong behavior. And this, in conjunction with the the general unpredictability of threads and the strength of the x64 memory model, means that it gets really hard to create reproduceable failures.

A slightly more reliable failure mode is when the optimizer does unexpected things, because you can observe those in the assembly. Of course, the optimizer is notoriously finicky as well and might do something completely different if you change just one code line.

Here's an example failure that we had in our code at one point. The code implemented a sort of spin lock, but didn't use atomics.

bool operation_done;
void thread1() {
  while (!operation_done) {
    sleep();
  }
  // do something that depends on operation being done
}
void thread2() {
  // do the operation
  operation_done = true;
}

This worked fine in debug mode, but the release build got stuck. Debugging showed that execution of thread1 never left the loop, and looking at the assembly, we found that the condition was gone; the loop was simply infinite.

The problem was that the optimizer realized that under its memory model, operation_done could not possibly change within the loop (that would have been a data race), and thus it "knew" that once the condition was true once, it would be true forever.

Changing the type of operation_done to atomic_bool (or actually, a pre-C++11 compiler-specific equivalent) fixed the issue.

188

answered Nov 15 '22 11:11

Sebastian Redl

This is my own version of @Sebastian Redl's answer that fits the question more closely. I will still accept his for credit + kudos to @HansPassant for his comment which brought my attention back to writes which made everything clear - since as soon as I observed that the compiler was adding synchronization on writes, the problem turned to be that it wasn't optimizing bool as much as one would expect.

I was able to have a trivial program that reproduces the problem:

std::atomic_bool foobar(true);
//bool foobar = true;

long long cnt = 0;
long long loops = 400000000ll;

void thread_1() {
    usleep(200000);
    foobar = false;
}

void thread_2() {
    while (loops--) {
        if (foobar) {
            ++cnt;
        }
    }
    std::cout << cnt << std::endl;
}

The main difference with my original code was that I used to have a usleep() inside the while loop. It was enough to prevent any optimizations within the while loop. The cleaner code above, yields the same asm for write:

enter image description here

but quite different for read:

enter image description here

We can see that in the bool case (left) clang brought the if (foobar) outside the loop. Thus when I run the bool case I get:

400000000

real    0m1.044s
user    0m1.032s
sys 0m0.005s

while when I run the atomic_bool case I get:

95393578

real    0m0.420s
user    0m0.414s
sys 0m0.003s

It's interesting that the atomic_bool case is faster - I guess because it does just 95 million incs on the counter contrary to 400 million in the bool case.

What is even more crazy-interesting though is this. If I move the std::cout << cnt << std::endl; out of the threaded code, after pthread_join(), the loop in the non-atomic case becomes just this:

enter image description here

i.e. there's no loop. It's just if (foobar!=0) cnt = loops;! Clever clang. Then the execution yields:

400000000

real    0m0.206s
user    0m0.001s
sys 0m0.002s

while the atomic_bool remains the same.

So more than enough evidence that we should use atomics. The only thing to remember is - don't put any usleep() on your benchmarks because even if it's small, it will prevent quite a few compiler optimizations.

answered Nov 15 '22 11:11

neverlastn

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Are atomic types necessary in multi-threading? (OS X, clang, c++11)

Tags:

c++

multithreading

gcc

c++11

clang

neverlastn

People also ask

Video Answer

2 Answers

Sebastian Redl

neverlastn

Recent Activity

Donate For Us

Are atomic types necessary in multi-threading? (OS X, clang, c++11)

Tags:

c++

multithreading

gcc

c++11

clang

neverlastn

People also ask

Video Answer

2 Answers

Sebastian Redl

neverlastn

Related questions

Recent Activity

Donate For Us