Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Are atomic types necessary in multi-threading? (OS X, clang, c++11)

I'm trying to demonstrate that it's very bad idea to not use std::atomic<>s but I can't manage to create an example that reproduces the failure. I have two threads and one of them does:

{
    foobar = false;
}

and the other:

{
    if (foobar) {
        // ...
    }
}

the type of foobar is either bool or std::atomic_bool and it's initialized to true. I'm using OS X Yosemite and even tried to use this trick to hint via CPU affinity that I want the threads to run on different cores. I run such operations in loops etc. and in any case, there's no observable difference in execution. I end up inspecting generated assembly with clang clang -std=c++11 -lstdc++ -O3 -S test.cpp and I see that the asm differences on read are minor (without atomic on left, with on right):

enter image description here

No mfence or something that "dramatic". On the write side, something more "dramatic" happens:

enter image description here

As you can see, the atomic<> version uses xchgb which uses an implicit lock. When I compile with a relatively old version of gcc (v4.5.2) I can see all sorts of mfences being added which also indicates there's a serious concern.

I kind of understand that "X86 implements a very strong memory model" (ref) and that mfences might not be necessary but does it mean that unless I want to write cross-platform code that e.g. supports ARM, I don't really need to put any atomic<>s unless I care for consistency at ns-level?

I've watched "atomic<> Weapons" from Herb Sutter but I'm still impressed with how difficult it is to create a simple example that reproduces those problems.

like image 396
neverlastn Avatar asked Sep 10 '16 18:09

neverlastn


People also ask

What types of threads are supported by the operating system?

Kernel threads are supported directly by the operating system. Any application can be programmed to be multithreaded. All of the threads within an application are supported within a single process.

What is thread in C++ 11?

C++ 11 did away with all that and gave us std::thread. The thread classes and related functions are defined in the thread header file. std::thread is the thread class that represents a single thread in C++.

How many threads can access the kernel at a time?

Only one thread can access the Kernel at a time, so multiple threads are unable to run in parallel on multiprocessors. If the user-level thread libraries are implemented in the operating system in such a way that the system does not support them, then the Kernel threads use the many-to-one relationship modes.

What are the different types of threads in Linux?

Threads are implemented in following two ways − User Level Threads − User managed threads. Kernel Level Threads − Operating System managed threads acting on kernel, an operating system core.


Video Answer


2 Answers

The big problem of data races is that they're undefined behavior, not guaranteed wrong behavior. And this, in conjunction with the the general unpredictability of threads and the strength of the x64 memory model, means that it gets really hard to create reproduceable failures.

A slightly more reliable failure mode is when the optimizer does unexpected things, because you can observe those in the assembly. Of course, the optimizer is notoriously finicky as well and might do something completely different if you change just one code line.

Here's an example failure that we had in our code at one point. The code implemented a sort of spin lock, but didn't use atomics.

bool operation_done;
void thread1() {
  while (!operation_done) {
    sleep();
  }
  // do something that depends on operation being done
}
void thread2() {
  // do the operation
  operation_done = true;
}

This worked fine in debug mode, but the release build got stuck. Debugging showed that execution of thread1 never left the loop, and looking at the assembly, we found that the condition was gone; the loop was simply infinite.

The problem was that the optimizer realized that under its memory model, operation_done could not possibly change within the loop (that would have been a data race), and thus it "knew" that once the condition was true once, it would be true forever.

Changing the type of operation_done to atomic_bool (or actually, a pre-C++11 compiler-specific equivalent) fixed the issue.

like image 188
Sebastian Redl Avatar answered Nov 15 '22 11:11

Sebastian Redl


This is my own version of @Sebastian Redl's answer that fits the question more closely. I will still accept his for credit + kudos to @HansPassant for his comment which brought my attention back to writes which made everything clear - since as soon as I observed that the compiler was adding synchronization on writes, the problem turned to be that it wasn't optimizing bool as much as one would expect.

I was able to have a trivial program that reproduces the problem:

std::atomic_bool foobar(true);
//bool foobar = true;

long long cnt = 0;
long long loops = 400000000ll;

void thread_1() {
    usleep(200000);
    foobar = false;
}

void thread_2() {
    while (loops--) {
        if (foobar) {
            ++cnt;
        }
    }
    std::cout << cnt << std::endl;
}

The main difference with my original code was that I used to have a usleep() inside the while loop. It was enough to prevent any optimizations within the while loop. The cleaner code above, yields the same asm for write:

enter image description here

but quite different for read:

enter image description here

We can see that in the bool case (left) clang brought the if (foobar) outside the loop. Thus when I run the bool case I get:

400000000

real    0m1.044s
user    0m1.032s
sys 0m0.005s

while when I run the atomic_bool case I get:

95393578

real    0m0.420s
user    0m0.414s
sys 0m0.003s

It's interesting that the atomic_bool case is faster - I guess because it does just 95 million incs on the counter contrary to 400 million in the bool case.

What is even more crazy-interesting though is this. If I move the std::cout << cnt << std::endl; out of the threaded code, after pthread_join(), the loop in the non-atomic case becomes just this:

enter image description here

i.e. there's no loop. It's just if (foobar!=0) cnt = loops;! Clever clang. Then the execution yields:

400000000

real    0m0.206s
user    0m0.001s
sys 0m0.002s

while the atomic_bool remains the same.

So more than enough evidence that we should use atomics. The only thing to remember is - don't put any usleep() on your benchmarks because even if it's small, it will prevent quite a few compiler optimizations.

like image 38
neverlastn Avatar answered Nov 15 '22 11:11

neverlastn