Is std::call_once lock free?

Question

I want to find out if the std::call_once lock free. There are call_once implementations using mutex. But why should we use mutex? I tried to write simple implementation using atomic_bool and CAS operation. Is the code thread safe?

#include <iostream>
#include <thread>
#include <atomic>
#include <unistd.h>

using namespace std;
using my_once_flag = atomic<bool>;

void my_call_once(my_once_flag& flag, std::function<void()> foo) {
    bool expected = false;
    bool res = flag.compare_exchange_strong(expected, true,
                                            std::memory_order_release, std::memory_order_relaxed);
    if(res)
        foo();
}
my_once_flag flag;
void printOnce() {
    usleep(100);
    my_call_once(flag, [](){
       cout << "test" << endl;
    });
}

int main() {
    for(int i = 0; i< 500; ++i){
            thread([](){
                printOnce();
            }).detach();
    }
    return 0;
}

Pete Becker · Accepted Answer

Your proposed implementation is not thread safe. It does guarantee that foo() will only be called once through this code, but it does not guarantee that all threads will see side effects from the call to foo(). Suppose that thread 1 executes the compare and gets true, then the scheduler switches to thread 2, before thread 2 calls foo(). Thread 2 will get false, skip the call to foo(), and move on. Since the call to foo() has not been executed, thread 2 can continue executing before any side effects from foo() have occurred.

Peter Cordes · Answer

The already-called-once fast-path can be wait-free.

gcc's implementation doesn't look all that efficient. I don't know why it isn't implemented the same way as initialization of static local variables with a non-constant arg, which uses a check that's very cheap (but not free!) for the case where it's already initialized.

http://en.cppreference.com/w/cpp/thread/call_once comments that:

Initialization of function-local statics is guaranteed to occur only once even when called from multiple threads, and may be more efficient than the equivalent code using std::call_once.

To make an efficient implementation, the std::once_flag could have three states:

execution finished: if you find this state, you're already done.
execution in progress: if you find this: wait until it changes to finished (or changes to failed-with-exception, which which case try to claim it)
execution not started: if you find this, attempt to CAS it to in-progress and call the function. If the CAS failed, some other thread succeeded, so goto the wait-for-finished state.

Checking the flag with an acquire-load is extremely cheap on most architectures (especially x86 where all loads are acquire-loads). Once it's set to "finished", it's not modified for the rest of the program, so it can stay cached in L1 on all cores (unless you put it in the same cache line as something that is frequently modified, creating false sharing).

Even if your implementation worked, it attempts an atomic CAS every time, which is ridiculously more expensive than a load-acquire.

I haven't fully decoded what gcc is doing for call_once, but it unconditionally does a bunch of loads, and two stores to thread-local storage, before checking if a pointer is NULL. (test rax,rax / je). But if it is, it calls std::__throw_system_error(int), so that's not a guard variable that it's using to detect the already-initialized case.

So it looks like it unconditionally calls __gthrw_pthread_once(int*, void (*)()), and checks the return value. So that pretty much sucks for use-cases where you want to cheaply make sure some initialization got done, while avoiding the static-initialization fiasco. (i.e. that your build procedure controls the ordering of constructors for static objects, not anything you put in the code itself.)

So I'd recommend using static int dummy = init_function(); where dummy is either something you actually want to construct, or just a way to call init_function for its side-effects.

Then on the fast-path, the asm from:

int called_once();

void static_local(){
  static char dummy = called_once();
  (void)dummy;
}

looks like this:

static_local():
    movzx   eax, BYTE PTR guard variable for static_local()::dummy[rip]
    test    al, al
    je      .L18
    ret
 .L18:
    ... # code that implements basically what I described above: call or wait

See it on the Godbolt compiler explorer, along with gcc's actual code for std::once_flag.

You could of course implement a guard variable yourself with an atomic uint8_t, which starts out initialized to non-zero, and is set to zero only when the call is complete. Testing for zero may be slightly cheaper on some ISAs, including x86 if the compiler is weird like gcc and decides to actually load it into a register instead of using cmp byte [guard], 0.

Is std::call_once lock free?

Tags:

c++

multithreading

c++11

lock-free

NoTrust

2 Answers

Pete Becker

Peter Cordes

Recent Activity

Donate For Us

Is std::call_once lock free?

Tags:

c++

multithreading

c++11

lock-free

NoTrust

2 Answers

Pete Becker

Peter Cordes

Related questions

Recent Activity

Donate For Us