Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Atomically clearing lowest non-zero bit of an unsigned integer

Question:
I'm looking for the best way to clear the lowest non-zero bit of a unsigned atomic like std::atomic_uint64_t in a threadsafe fashion without using an extra mutex or the like. In addition, I also need to know, which bit got cleared.

Example: Lets say, if the current value stored is 0b0110 I want to know that the lowest non-zero bit is bit 1 (0-indexed) and set the variable to 0b0100.

The best version I came up with is this:

#include <atomic>
#include <cstdint>

inline uint64_t with_lowest_non_zero_cleared(std::uint64_t v){
    return v-1 & v;
}
inline uint64_t only_keep_lowest_non_zero(std::uint64_t v){
    return v & ~with_lowest_non_zero_cleared(v);
}

uint64_t pop_lowest_non_zero(std::atomic_uint64_t& foo)
{
    auto expected = foo.load(std::memory_order_relaxed);
    while(!foo.compare_exchange_weak(expected,with_lowest_non_zero_cleared(expected), std::memory_order_seq_cst, std::memory_order_relaxed))
    {;}
    return only_keep_lowest_non_zero(expected);
}

is there a better solution?

Notes:

  • It doesn't have to be the lowest non-zero bit - I'd also be happy with a solution that clears the highest bit or sets the highest/lowest "zero bit" if that makes a difference
  • I'd very much prefer a standard c++ (17) version, but would acccept an answer that uses intrinsics for clang and msvc
  • It should be portable (compiling with clang or msvc for x64 and AArch64), but I'm most interested in the performance on recent intel and AMD x64 architectures when compiled with clang.
  • EDIT: The update has to be atomic and global progress must be guaranteed, but just as with solution above, it doesn't have to be a wait free algorithm (of course I'd be very happy if you can show me one).
like image 712
MikeMB Avatar asked Jul 15 '18 08:07

MikeMB


1 Answers

There is no direct hardware support for atomic clear-lowest-bit on x86. BMI1 blsr is only available in memory-source form, not memory-destination form; lock blsr [shared_var] does not exist. (Compilers know how to optimize (var-1) & (var) into blsr for local vars when you compile with -march=haswell or otherwise enable code-gen that assumes BMI1 support.) So even if you can assume BMI1 support1, it doesn't let you do anything fundamentally different.

The best you can do on x86 is the lock cmpxchg retry loop you propose in the question. This is a better option than finding the right bit in an old version of the variable and then using lock btr, especially if it would be a correctness problem to clear the wrong bit if a lower bit was set between the bit-scan and the lock btr. And you'd still need a retry loop in case another thread already cleared the bit you wanted.

CAS retry loops are not bad in practice: retry is quite rare

(Unless your program has very high contention for the shared variable, which would be problematic for performance even with lock add where there's no trying, just hardware arbitration for access to cache lines. If that's the case, you should probably redesign your lockless algorithm and/or consider some kind of OS-assisted sleep/wake instead of having a lot of cores spending a lot of CPU time hammering on the same cache line. Lockless is great in the low-contention case.)

The window for the CPU to lose the cache line between the load to get expected and running lock cmpxchg with the result of a couple instructions on that value is tiny. Usually it will succeed the first time through, because the cache line will still be present in L1d cache when the cmpxchg runs. When the cache line arrives, it will (hopefully) already be in MESI Exclusive state, if the CPU saw far enough ahead to do an RFO for it.

You can instrument your cmpxchg retry loops to see how much contention you actually get in your real program. (e.g. by incrementing a local inside the loop and using an atomic relaxed += into a shared counter once you succeed, or with thread-local counters.)

Remember that your real code (hopefully) does plenty of work between atomic operations on this bitmask, so it's very different from a microbenchmark where all the threads spend all their time hammering on that cache line.

EDIT: The update has to be atomic and global progress must be guaranteed, but just as with solution above, it doesn't have to be a wait free algorithm (of course I'd be very happy if you can show me one).

A CAS retry loop (even when compiled on a LL/SC machine, see below) is lock-free in the technical sense: you only have to retry if some other thread made progress.

CAS leaves the cache line unmodified if it fails. On x86 it dirties the cache line (MESI M state), but x86 cmpxchg doesn't detect ABA, it only compares, so one other thread that loaded the same expected will succeed. On LL/SC machines, hopefully an SC failure on one core doesn't cause surious SC failures on other cores, otherwise it could livelock. That's something you can assume CPU architects thought of.

It's not wait-free because a single thread can in theory have to retry an unbounded number of times.


Your code compiles with gcc8.1 -O3 -march=haswell to this asm (from the Godbolt compiler explorer)

# gcc's code-gen for x86 with BMI1 looks optimal to me.  No wasted instructions!
# presumably you'll get something similar when inlining.
pop_lowest_non_zero(std::atomic<unsigned long>&):
    mov     rax, QWORD PTR [rdi]
.L2:
    blsr    rdx, rax                      # Bit Lowest Set Reset
    lock cmpxchg    QWORD PTR [rdi], rdx
    jne     .L2                           # fall through on success: cmpxchg sets ZF on equal like regular cmp
    blsi    rax, rax                      # Bit Lowest Set Isolate
    ret

Without BMI1, blsr and blsi become two instructions each. This is close to irrelevant for performance given the cost of a locked instruction.

clang and MSVC unfortunately are slightly more clunky, with a taken branch on the no-contention fast path. (And clang bloats the function by peeling the first iteration. IDK if this helps with branch prediction or something, or if it's purely a missed optimization. We can get clang to generate the fast path with no taken branches using an unlikely() macro. How do the likely/unlikely macros in the Linux kernel work and what is their benefit?).


Footnote 1:

Unless you're building binaries for a known set of machines, you can't assume BMI1 is available anyway. So the compiler will need to do something like lea rdx, [rax-1] / and rdx, rax instead of blsr rdx, rax. This makes a trivial difference for this function; the majority of the cost is the atomic operation even in the uncontended case, even for the hot-in-L1d cache no contention case, looking at the out-of-order execution throughput impact on surrounding code. (e.g. 10 uops for lock cmpxchg on Skylake (http://agner.org/optimize/) vs. saving 1 uop with blsr instead of 2 other instructions. Assuming the front-end is the bottleneck, rather than memory or something else. Being a full memory barrier has an impact on loads/stores in surrounding code, too, but fortunately not on out-of-order execution of independent ALU instructions.)


Non-x86: LL/SC machines

Most non-x86 machines do all their atomic operations with load-linked / store-conditional retry loops. It's a bit unfortunate that C++11 doesn't allow you to create custom LL/SC operations (e.g. with (x-1) & x inside an LL/SC instead of just the add that you'd get from using fetch_add), but CAS machines (like x86) can't give you the ABA detection that LL/SC provides. So it's not clear how you'd design a C++ class that could compile efficiently on x86 but also directly to a LL/SC retry loop on ARM and other LL/SC ISAs. (See this discussion.)

So you just have to write code using compare_exchange_weak if C++ doesn't provide the operation you want (e.g. fetch_or or fetch_and).

What you get in practice from current compilers is a compare_exchange_weak implemented with LL/SC, and your arithmetic operation separate from that. The asm actually does the relaxed load before the exclusive-load-acquire (ldaxr), instead of just basing the computation on the ldaxr result. And there's extra branching to check that expected from the last attempt matches the new load result before attempting the store.

The LL/SC window is maybe shorter than with 2 dependent ALU instructions between the load and store, in case that matters. The CPU has the desired value ready ahead of time, not dependent on the load result. (It depends on the previous load result.) Clang's code-gen puts the sub/and inside the loop, but dependent on the previous iteration's load, so with out of order execution they can still be ready ahead of time.

But if that was really the most efficient way to do things, compilers should be using it for fetch_add as well. They don't because it probably isn't. LL/SC retries are rare, just like CAS retries on x86. I don't know if it could make a different for very-high-contention situations. Maybe, but compilers don't slow down the fast path to optimize for that when compiling fetch_add.

I renamed your functions and re-formatted the while() for readability, because it was already too long for one line without wrapping it with unlikely().

This version compiles to maybe slightly nicer asm than your original, with clang. I also fixed your function names so it actually compiles at all; there's a mismatch in your question. I picked totally different names that are similar to x86's BMI instruction names because they succinctly describe the operation.

#include <atomic>
#include <cstdint>

#ifdef __GNUC__
#define unlikely(expr) __builtin_expect(!!(expr), 0)
#define likely(expr) __builtin_expect(!!(expr), 1)
#else
#define unlikely(expr) (expr)
#define likely(expr) (expr)
#endif

inline uint64_t clear_lowest_set(std::uint64_t v){
    return v-1 & v;
}
inline uint64_t isolate_lowest_set(std::uint64_t v){
    //return v & ~clear_lowest_set(v);
    return (-v) & v;
    // MSVC optimizes this better for ARM when used separately.
    // The other way (in terms of clear_lowest_set) does still CSE to 2 instructions
    //  when the clear_lowest_set result is already available
}

uint64_t pop_lowest_non_zero(std::atomic_uint64_t& foo)
{
    auto expected = foo.load(std::memory_order_relaxed);
    while(unlikely(!foo.compare_exchange_weak(
        expected, clear_lowest_set(expected), 
        std::memory_order_seq_cst, std::memory_order_relaxed)))
        {}

    return isolate_lowest_set(expected);
}

Clang -O3 for AArch64 (-target aarch64-linux-android -stdlib=libc++ -mcpu=cortex-a57 on Godbolt) makes some weird code. This is from clang6.0, but there is weirdness with older versions, too, creating a 0/1 integer in a register and then testing it instead of just jumping to the right place in the first place.

pop_lowest_non_zero(std::__1::atomic<unsigned long long>&): // @pop_lowest_non_zero(std::__1::atomic<unsigned long long>&)

    ldr     x9, [x0]                   @ the relaxed load
    ldaxr   x8, [x0]                   @ the CAS load (a=acquire, x=exclusive: pairs with a later stxr)
    cmp     x8, x9                   @ compare part of the CAS
    b.ne    .LBB0_3
    sub     x10, x9, #1
    and     x10, x10, x9             @ clear_lowest( relaxed load result)
    stlxr   w11, x10, [x0]           @ the CAS store (sequential-release)
    cbnz    w11, .LBB0_4             @ if(SC failed) goto retry loop

          # else fall through and eventually reach the epilogue

    # this looks insane.  w10 = 0|1, then branch if bit[0] != 0.  Always taken, right?
    orr     w10, wzr, #0x1
    tbnz    w10, #0, .LBB0_5         @ test-bit not-zero will always jump to the epilogue
        b       .LBB0_6              # never reached


.LBB0_3:
    clrex                           @ clear the ldrx/stxr transaction
.LBB0_4:
    # This block is pointless; just should b to .LBB0_6
    mov     w10, wzr
    tbz     w10, #0, .LBB0_6    # go to the retry loop, below the ret (not shown here)

.LBB0_5:               # isolate_lowest_set and return
    neg     x8, x9
    and     x0, x9, x8
    ret

.LBB0_6:
     @ the retry loop.  Only reached if the compare or SC failed.
     ...

All this crazy branching might not be a real performance problem, but it's obvious this could be a lot more efficient (even without teaching clang how to use LL/SC directly instead of emulated CAS). Reported as clang/LLVM bug 38173

It seems MSVC doesn't know that AArch64's release-stores already have sequentially-consistent semantics wrt. ldar / ldaxr loads (not just regular release like x86): it's still using a dmb ish instruction (full memory barrier) after stlxr. It has fewer wasted instructions, but its x86 bias is showing or something: compare_exchange_weak compiles like compare_exchange_strong with a probably-useless retry loop that will try just the CAS again with the old expected/desired, on LL/SC failure. That will usually be because another thread modified the line, so expected will mismatch. (Godbolt doesn't have MSVC for AArch64 in any older versions, so perhaps support is brand new. That might explain the dmb)

   ## MSVC Pre 2018 -Ox
|pop_lowest_non_zero| PROC
    ldr         x10,[x0]          # x10 = expected = foo.load(relaxed)

|$LL2@pop_lowest|           @ L2  # top of the while() loop
    sub         x8,x10,#1
    and         x11,x8,x10        # clear_lowest(relaxed load result)
|$LN76@pop_lowest|          @ LN76
    ldaxr       x8,[x0]
    cmp         x8,x10            # the compare part of CAS
    bne         |$LN77@pop_lowest|
    stlxr       w9,x11,[x0]
    cbnz        w9,|$LN76@pop_lowest|  # retry just the CAS on LL/SC fail; this looks like compare_exchange_strong
     # fall through on LL/SC success

|$LN77@pop_lowest|          @ LN77
    dmb         ish                # full memory barrier: slow
    cmp         x8,x10             # compare again, because apparently MSVC wants to share the `dmb` instruction between the CAS-fail and CAS-success paths.
    beq         |$LN75@pop_lowest| # if(expected matches) goto epilogue
    mov         x10,x8             # else update expected
    b           |$LL2@pop_lowest|  # and goto the top of the while loop

|$LN75@pop_lowest|          @ LN75   # function epilogue
    neg         x8,x10
    and         x0,x8,x10
    ret

gcc6.3 for AArch64 makes weird code, too, storing expected to the stack. (Godbolt doesn't have newer AArch64 gcc).


Hand-written AArch64 version that sucks less

I'm very unimpressed with AArch64 compilers for this! IDK why they don't generate something clean and efficient like this, with no taken branches on the fast path, and only a bit of out-of-line code to set up for jumping back into the CAS to retry.

pop_lowest_non_zero ## hand written based on clang
                    # clang could emit this even without turning CAS into LL/SC directly

    ldr     x9, [x0]                   @ x9 = expected = foo.load(relaxed)
 .Lcas_retry:
    ldaxr   x8, [x0]                   @ x8 = the CAS load (a=acquire, x=exclusive: pairs with a later stxr)
    cmp     x8, x9                   @ compare part of the CAS
    b.ne    .Lcas_fail
    sub     x10, x9, #1
    and     x10, x10, x9             @ clear_lowest (relaxed load result)
    stlxr   w11, x10, [x0]           @ the CAS store (sequential-release)
    cbnz    w11, .Lllsc_fail

    # LL/SC success: isolate_lowest_set and return
.Lepilogue:
    neg     x8, x9
    and     x0, x9, x8
    ret

.Lcas_fail:
    // clrex        @ We're about to start another ldaxr making clrex unnecessary
.Lllsc_fail:
    mov     x9, x8                  @ update expected
    b     .Lcas_retry           @ instead of duplicating the loop, jump back to the existing one with x9 = expected

Compare with .fetch_add:

Clang does make nice code for fetch_add():

    mov     x8, x0
.LBB1_1:
    ldxr    x0, [x8]                # relaxed exclusive load: I used mo_release
    add     x9, x0, #1
    stlxr   w10, x9, [x8]
    cbnz    w10, .LBB1_1            # retry if LL/SC failed
    ret

Instead of add #1, we'd like to have sub x9, x8, #1 / and x9, x9, x8, so we just get a LL/SC retry loop. This saves code-size, but it won't be significantly faster. Probably not even measurably faster in most cases, especially as part of a whole program where it's not used an insane amount.


Alternatives: Do you even need exactly this bitmap operation? Can you break it up to reduce contention?

Can you use an atomic counter instead of a bitmap, and map it to a bitmap when needed? Operations that want a bitmap can map the counter to a bitmap with uint64_t(~0ULL) << (64-counter) (for non-zero counter only). Or maybe tmp=1ULL << counter; tmp ^= tmp-1; (i.e. x86 xor eax,eax / bts rax, rdi (rax=1 set bit at position 0..63) / blsmsk rax, rax (rax=all bits set up to that position). Hmm, that still needs a special case for mask=0, because there are 65 possible states for a contiguous bitmask: highest (or lowest) bit at one of 64 positions, or no bits set at all.

Or if there's some pattern to the bitmap, x86 BMI2 pdep can scatter contiguous bits into that pattern. Get N contiguous set bits with (1ULL << counter) - 1, again for counter < 64.


Or maybe it has to be a bitmask, but doesn't need to be one single bitmask?

No idea what your use-case is, but this kind of idea might be useful:

Do you need a single atomic bitmap that all threads have to contend for? Perhaps you could break it into multiple chunks, each in a separate cache line. (But that makes it impossible to atomically snapshot the entire map.) Still, if each thread has a preferred chunk, and always tries that before moving on to look for a slot in another chunk, you might reduce contention in the average case.

In asm (or with horrible union hacks in C++), you could try to reduce cmpxchg failures without reducing contention by finding the right byte of the 64-bit integer to update, and then only lock cmpxchg on it. That's not actually helpful for this case because two threads that see the same integer will both try to clear the same bit. But it could reduce retries between this and something that sets other bits in the qword. Of course this only works if it's not a correctness problem for lock cmpxchg to succeed when other bytes of the qword have changed.

like image 165
Peter Cordes Avatar answered Sep 29 '22 03:09

Peter Cordes