I want to find out if the std::call_once lock free. There are call_once implementations using mutex. But why should we use mutex? I tried to write simple implementation using atomic_bool and CAS operation. Is the code thread safe?
#include <iostream>
#include <thread>
#include <atomic>
#include <unistd.h>
using namespace std;
using my_once_flag = atomic<bool>;
void my_call_once(my_once_flag& flag, std::function<void()> foo) {
bool expected = false;
bool res = flag.compare_exchange_strong(expected, true,
std::memory_order_release, std::memory_order_relaxed);
if(res)
foo();
}
my_once_flag flag;
void printOnce() {
usleep(100);
my_call_once(flag, [](){
cout << "test" << endl;
});
}
int main() {
for(int i = 0; i< 500; ++i){
thread([](){
printOnce();
}).detach();
}
return 0;
}
Your proposed implementation is not thread safe. It does guarantee that foo()
will only be called once through this code, but it does not guarantee that all threads will see side effects from the call to foo()
. Suppose that thread 1 executes the compare and gets true, then the scheduler switches to thread 2, before thread 2 calls foo()
. Thread 2 will get false, skip the call to foo()
, and move on. Since the call to foo()
has not been executed, thread 2 can continue executing before any side effects from foo()
have occurred.
The already-called-once fast-path can be wait-free.
gcc's implementation doesn't look all that efficient. I don't know why it isn't implemented the same way as initialization of static
local variables with a non-constant arg, which uses a check that's very cheap (but not free!) for the case where it's already initialized.
http://en.cppreference.com/w/cpp/thread/call_once comments that:
Initialization of function-local statics is guaranteed to occur only once even when called from multiple threads, and may be more efficient than the equivalent code using std::call_once.
To make an efficient implementation, the std::once_flag
could have three states:
Checking the flag with an acquire-load is extremely cheap on most architectures (especially x86 where all loads are acquire-loads). Once it's set to "finished", it's not modified for the rest of the program, so it can stay cached in L1 on all cores (unless you put it in the same cache line as something that is frequently modified, creating false sharing).
Even if your implementation worked, it attempts an atomic CAS every time, which is ridiculously more expensive than a load-acquire.
I haven't fully decoded what gcc is doing for call_once
, but it unconditionally does a bunch of loads, and two stores to thread-local storage, before checking if a pointer is NULL. (test rax,rax / je
). But if it is, it calls std::__throw_system_error(int)
, so that's not a guard variable that it's using to detect the already-initialized case.
So it looks like it unconditionally calls __gthrw_pthread_once(int*, void (*)())
, and checks the return value. So that pretty much sucks for use-cases where you want to cheaply make sure some initialization got done, while avoiding the static-initialization fiasco. (i.e. that your build procedure controls the ordering of constructors for static objects, not anything you put in the code itself.)
So I'd recommend using static int dummy = init_function();
where dummy is either something you actually want to construct, or just a way to call init_function
for its side-effects.
Then on the fast-path, the asm from:
int called_once();
void static_local(){
static char dummy = called_once();
(void)dummy;
}
looks like this:
static_local():
movzx eax, BYTE PTR guard variable for static_local()::dummy[rip]
test al, al
je .L18
ret
.L18:
... # code that implements basically what I described above: call or wait
See it on the Godbolt compiler explorer, along with gcc's actual code for std::once_flag
.
You could of course implement a guard variable yourself with an atomic uint8_t, which starts out initialized to non-zero, and is set to zero only when the call is complete. Testing for zero may be slightly cheaper on some ISAs, including x86 if the compiler is weird like gcc and decides to actually load it into a register instead of using cmp byte [guard], 0
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With