Question:
I'm looking for the best way to clear the lowest non-zero bit of a unsigned atomic like std::atomic_uint64_t
in a threadsafe fashion without using an extra mutex or the like. In addition, I also need to know, which bit got cleared.
Example:
Lets say, if the current value stored is 0b0110
I want to know that the lowest non-zero bit is bit 1 (0-indexed) and set the variable to 0b0100
.
The best version I came up with is this:
#include <atomic>
#include <cstdint>
inline uint64_t with_lowest_non_zero_cleared(std::uint64_t v){
return v-1 & v;
}
inline uint64_t only_keep_lowest_non_zero(std::uint64_t v){
return v & ~with_lowest_non_zero_cleared(v);
}
uint64_t pop_lowest_non_zero(std::atomic_uint64_t& foo)
{
auto expected = foo.load(std::memory_order_relaxed);
while(!foo.compare_exchange_weak(expected,with_lowest_non_zero_cleared(expected), std::memory_order_seq_cst, std::memory_order_relaxed))
{;}
return only_keep_lowest_non_zero(expected);
}
is there a better solution?
Notes:
There is no direct hardware support for atomic clear-lowest-bit on x86. BMI1 blsr
is only available in memory-source form, not memory-destination form; lock blsr [shared_var]
does not exist. (Compilers know how to optimize (var-1) & (var)
into blsr
for local vars when you compile with -march=haswell
or otherwise enable code-gen that assumes BMI1 support.) So even if you can assume BMI1 support1, it doesn't let you do anything fundamentally different.
The best you can do on x86 is the lock cmpxchg
retry loop you propose in the question. This is a better option than finding the right bit in an old version of the variable and then using lock btr
, especially if it would be a correctness problem to clear the wrong bit if a lower bit was set between the bit-scan and the lock btr
. And you'd still need a retry loop in case another thread already cleared the bit you wanted.
(Unless your program has very high contention for the shared variable, which would be problematic for performance even with lock add
where there's no trying, just hardware arbitration for access to cache lines. If that's the case, you should probably redesign your lockless algorithm and/or consider some kind of OS-assisted sleep/wake instead of having a lot of cores spending a lot of CPU time hammering on the same cache line. Lockless is great in the low-contention case.)
The window for the CPU to lose the cache line between the load to get expected
and running lock cmpxchg
with the result of a couple instructions on that value is tiny. Usually it will succeed the first time through, because the cache line will still be present in L1d cache when the cmpxchg
runs. When the cache line arrives, it will (hopefully) already be in MESI Exclusive state, if the CPU saw far enough ahead to do an RFO for it.
You can instrument your cmpxchg retry loops to see how much contention you actually get in your real program. (e.g. by incrementing a local inside the loop and using an atomic relaxed +=
into a shared counter once you succeed, or with thread-local counters.)
Remember that your real code (hopefully) does plenty of work between atomic operations on this bitmask, so it's very different from a microbenchmark where all the threads spend all their time hammering on that cache line.
EDIT: The update has to be atomic and global progress must be guaranteed, but just as with solution above, it doesn't have to be a wait free algorithm (of course I'd be very happy if you can show me one).
A CAS retry loop (even when compiled on a LL/SC machine, see below) is lock-free in the technical sense: you only have to retry if some other thread made progress.
CAS leaves the cache line unmodified if it fails. On x86 it dirties the cache line (MESI M state), but x86 cmpxchg doesn't detect ABA, it only compares, so one other thread that loaded the same expected
will succeed. On LL/SC machines, hopefully an SC failure on one core doesn't cause surious SC failures on other cores, otherwise it could livelock. That's something you can assume CPU architects thought of.
It's not wait-free because a single thread can in theory have to retry an unbounded number of times.
Your code compiles with gcc8.1 -O3 -march=haswell
to this asm (from the Godbolt compiler explorer)
# gcc's code-gen for x86 with BMI1 looks optimal to me. No wasted instructions!
# presumably you'll get something similar when inlining.
pop_lowest_non_zero(std::atomic<unsigned long>&):
mov rax, QWORD PTR [rdi]
.L2:
blsr rdx, rax # Bit Lowest Set Reset
lock cmpxchg QWORD PTR [rdi], rdx
jne .L2 # fall through on success: cmpxchg sets ZF on equal like regular cmp
blsi rax, rax # Bit Lowest Set Isolate
ret
Without BMI1, blsr and blsi become two instructions each. This is close to irrelevant for performance given the cost of a lock
ed instruction.
clang and MSVC unfortunately are slightly more clunky, with a taken branch on the no-contention fast path. (And clang bloats the function by peeling the first iteration. IDK if this helps with branch prediction or something, or if it's purely a missed optimization. We can get clang to generate the fast path with no taken branches using an unlikely()
macro. How do the likely/unlikely macros in the Linux kernel work and what is their benefit?).
Footnote 1:
Unless you're building binaries for a known set of machines, you can't assume BMI1 is available anyway. So the compiler will need to do something like lea rdx, [rax-1]
/ and rdx, rax
instead of blsr rdx, rax
. This makes a trivial difference for this function; the majority of the cost is the atomic operation even in the uncontended case, even for the hot-in-L1d cache no contention case, looking at the out-of-order execution throughput impact on surrounding code. (e.g. 10 uops for lock cmpxchg
on Skylake (http://agner.org/optimize/) vs. saving 1 uop with blsr
instead of 2 other instructions. Assuming the front-end is the bottleneck, rather than memory or something else. Being a full memory barrier has an impact on loads/stores in surrounding code, too, but fortunately not on out-of-order execution of independent ALU instructions.)
Most non-x86 machines do all their atomic operations with load-linked / store-conditional retry loops. It's a bit unfortunate that C++11 doesn't allow you to create custom LL/SC operations (e.g. with (x-1) & x
inside an LL/SC instead of just the add that you'd get from using fetch_add
), but CAS machines (like x86) can't give you the ABA detection that LL/SC provides. So it's not clear how you'd design a C++ class that could compile efficiently on x86 but also directly to a LL/SC retry loop on ARM and other LL/SC ISAs. (See this discussion.)
So you just have to write code using compare_exchange_weak
if C++ doesn't provide the operation you want (e.g. fetch_or
or fetch_and
).
What you get in practice from current compilers is a compare_exchange_weak
implemented with LL/SC, and your arithmetic operation separate from that. The asm actually does the relaxed load before the exclusive-load-acquire (ldaxr
), instead of just basing the computation on the ldaxr
result. And there's extra branching to check that expected
from the last attempt matches the new load result before attempting the store.
The LL/SC window is maybe shorter than with 2 dependent ALU instructions between the load and store, in case that matters. The CPU has the desired value ready ahead of time, not dependent on the load result. (It depends on the previous load result.) Clang's code-gen puts the sub
/and
inside the loop, but dependent on the previous iteration's load, so with out of order execution they can still be ready ahead of time.
But if that was really the most efficient way to do things, compilers should be using it for fetch_add
as well. They don't because it probably isn't. LL/SC retries are rare, just like CAS retries on x86. I don't know if it could make a different for very-high-contention situations. Maybe, but compilers don't slow down the fast path to optimize for that when compiling fetch_add
.
I renamed your functions and re-formatted the while()
for readability, because it was already too long for one line without wrapping it with unlikely()
.
This version compiles to maybe slightly nicer asm than your original, with clang. I also fixed your function names so it actually compiles at all; there's a mismatch in your question. I picked totally different names that are similar to x86's BMI instruction names because they succinctly describe the operation.
#include <atomic>
#include <cstdint>
#ifdef __GNUC__
#define unlikely(expr) __builtin_expect(!!(expr), 0)
#define likely(expr) __builtin_expect(!!(expr), 1)
#else
#define unlikely(expr) (expr)
#define likely(expr) (expr)
#endif
inline uint64_t clear_lowest_set(std::uint64_t v){
return v-1 & v;
}
inline uint64_t isolate_lowest_set(std::uint64_t v){
//return v & ~clear_lowest_set(v);
return (-v) & v;
// MSVC optimizes this better for ARM when used separately.
// The other way (in terms of clear_lowest_set) does still CSE to 2 instructions
// when the clear_lowest_set result is already available
}
uint64_t pop_lowest_non_zero(std::atomic_uint64_t& foo)
{
auto expected = foo.load(std::memory_order_relaxed);
while(unlikely(!foo.compare_exchange_weak(
expected, clear_lowest_set(expected),
std::memory_order_seq_cst, std::memory_order_relaxed)))
{}
return isolate_lowest_set(expected);
}
Clang -O3
for AArch64 (-target aarch64-linux-android -stdlib=libc++ -mcpu=cortex-a57
on Godbolt) makes some weird code. This is from clang6.0, but there is weirdness with older versions, too, creating a 0/1 integer in a register and then testing it instead of just jumping to the right place in the first place.
pop_lowest_non_zero(std::__1::atomic<unsigned long long>&): // @pop_lowest_non_zero(std::__1::atomic<unsigned long long>&)
ldr x9, [x0] @ the relaxed load
ldaxr x8, [x0] @ the CAS load (a=acquire, x=exclusive: pairs with a later stxr)
cmp x8, x9 @ compare part of the CAS
b.ne .LBB0_3
sub x10, x9, #1
and x10, x10, x9 @ clear_lowest( relaxed load result)
stlxr w11, x10, [x0] @ the CAS store (sequential-release)
cbnz w11, .LBB0_4 @ if(SC failed) goto retry loop
# else fall through and eventually reach the epilogue
# this looks insane. w10 = 0|1, then branch if bit[0] != 0. Always taken, right?
orr w10, wzr, #0x1
tbnz w10, #0, .LBB0_5 @ test-bit not-zero will always jump to the epilogue
b .LBB0_6 # never reached
.LBB0_3:
clrex @ clear the ldrx/stxr transaction
.LBB0_4:
# This block is pointless; just should b to .LBB0_6
mov w10, wzr
tbz w10, #0, .LBB0_6 # go to the retry loop, below the ret (not shown here)
.LBB0_5: # isolate_lowest_set and return
neg x8, x9
and x0, x9, x8
ret
.LBB0_6:
@ the retry loop. Only reached if the compare or SC failed.
...
All this crazy branching might not be a real performance problem, but it's obvious this could be a lot more efficient (even without teaching clang how to use LL/SC directly instead of emulated CAS). Reported as clang/LLVM bug 38173
It seems MSVC doesn't know that AArch64's release-stores already have sequentially-consistent semantics wrt. ldar
/ ldaxr
loads (not just regular release like x86): it's still using a dmb ish
instruction (full memory barrier) after stlxr
. It has fewer wasted instructions, but its x86 bias is showing or something: compare_exchange_weak
compiles like compare_exchange_strong
with a probably-useless retry loop that will try just the CAS again with the old expected
/desired
, on LL/SC failure. That will usually be because another thread modified the line, so expected
will mismatch. (Godbolt doesn't have MSVC for AArch64 in any older versions, so perhaps support is brand new. That might explain the dmb
)
## MSVC Pre 2018 -Ox
|pop_lowest_non_zero| PROC
ldr x10,[x0] # x10 = expected = foo.load(relaxed)
|$LL2@pop_lowest| @ L2 # top of the while() loop
sub x8,x10,#1
and x11,x8,x10 # clear_lowest(relaxed load result)
|$LN76@pop_lowest| @ LN76
ldaxr x8,[x0]
cmp x8,x10 # the compare part of CAS
bne |$LN77@pop_lowest|
stlxr w9,x11,[x0]
cbnz w9,|$LN76@pop_lowest| # retry just the CAS on LL/SC fail; this looks like compare_exchange_strong
# fall through on LL/SC success
|$LN77@pop_lowest| @ LN77
dmb ish # full memory barrier: slow
cmp x8,x10 # compare again, because apparently MSVC wants to share the `dmb` instruction between the CAS-fail and CAS-success paths.
beq |$LN75@pop_lowest| # if(expected matches) goto epilogue
mov x10,x8 # else update expected
b |$LL2@pop_lowest| # and goto the top of the while loop
|$LN75@pop_lowest| @ LN75 # function epilogue
neg x8,x10
and x0,x8,x10
ret
gcc6.3 for AArch64 makes weird code, too, storing expected
to the stack. (Godbolt doesn't have newer AArch64 gcc).
I'm very unimpressed with AArch64 compilers for this! IDK why they don't generate something clean and efficient like this, with no taken branches on the fast path, and only a bit of out-of-line code to set up for jumping back into the CAS to retry.
pop_lowest_non_zero ## hand written based on clang
# clang could emit this even without turning CAS into LL/SC directly
ldr x9, [x0] @ x9 = expected = foo.load(relaxed)
.Lcas_retry:
ldaxr x8, [x0] @ x8 = the CAS load (a=acquire, x=exclusive: pairs with a later stxr)
cmp x8, x9 @ compare part of the CAS
b.ne .Lcas_fail
sub x10, x9, #1
and x10, x10, x9 @ clear_lowest (relaxed load result)
stlxr w11, x10, [x0] @ the CAS store (sequential-release)
cbnz w11, .Lllsc_fail
# LL/SC success: isolate_lowest_set and return
.Lepilogue:
neg x8, x9
and x0, x9, x8
ret
.Lcas_fail:
// clrex @ We're about to start another ldaxr making clrex unnecessary
.Lllsc_fail:
mov x9, x8 @ update expected
b .Lcas_retry @ instead of duplicating the loop, jump back to the existing one with x9 = expected
.fetch_add
:Clang does make nice code for fetch_add()
:
mov x8, x0
.LBB1_1:
ldxr x0, [x8] # relaxed exclusive load: I used mo_release
add x9, x0, #1
stlxr w10, x9, [x8]
cbnz w10, .LBB1_1 # retry if LL/SC failed
ret
Instead of add #1
, we'd like to have sub x9, x8, #1
/ and x9, x9, x8
, so we just get a LL/SC retry loop. This saves code-size, but it won't be significantly faster. Probably not even measurably faster in most cases, especially as part of a whole program where it's not used an insane amount.
Can you use an atomic counter instead of a bitmap, and map it to a bitmap when needed? Operations that want a bitmap can map the counter to a bitmap with uint64_t(~0ULL) << (64-counter)
(for non-zero counter only). Or maybe tmp=1ULL << counter; tmp ^= tmp-1;
(i.e. x86 xor eax,eax
/ bts rax, rdi
(rax=1 set bit at position 0..63) / blsmsk rax, rax
(rax=all bits set up to that position). Hmm, that still needs a special case for mask=0, because there are 65 possible states for a contiguous bitmask: highest (or lowest) bit at one of 64 positions, or no bits set at all.
Or if there's some pattern to the bitmap, x86 BMI2 pdep
can scatter contiguous bits into that pattern. Get N contiguous set bits with (1ULL << counter) - 1
, again for counter < 64.
Or maybe it has to be a bitmask, but doesn't need to be one single bitmask?
No idea what your use-case is, but this kind of idea might be useful:
Do you need a single atomic bitmap that all threads have to contend for? Perhaps you could break it into multiple chunks, each in a separate cache line. (But that makes it impossible to atomically snapshot the entire map.) Still, if each thread has a preferred chunk, and always tries that before moving on to look for a slot in another chunk, you might reduce contention in the average case.
In asm (or with horrible union hacks in C++), you could try to reduce cmpxchg failures without reducing contention by finding the right byte of the 64-bit integer to update, and then only lock cmpxchg
on it. That's not actually helpful for this case because two threads that see the same integer will both try to clear the same bit. But it could reduce retries between this and something that sets other bits in the qword. Of course this only works if it's not a correctness problem for lock cmpxchg
to succeed when other bytes of the qword have changed.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With