Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why isn't CAS (Compare And Swap) equivalent to busy wait loops?

Reading a bit about lock free programming in the past few days I came across the util.java.Random class creating it's bits using the following routine:

protected int next(int bits) {
    long oldseed, nextseed;
    AtomicLong seed = this.seed;
    do {
        oldseed = seed.get();
        nextseed = (oldseed * multiplier + addend) & mask;
    } while (!seed.compareAndSet(oldseed, nextseed));
    return (int)(nextseed >>> (48 - bits));
}

According to this answer to a post "Spinlock vs Busy wait" :

So-called lock-free algorithms tend to use tight busy-waiting with a CAS instruction, but the contention is in ordinary situations so low that the CPU usually have to iterate only a few times.

And the Wikipedia topic "Compare-and-Swap" :

Instead of immediately retrying after a CAS operation fails, researchers have found that total system performance can be improved in multiprocessor systems—where many threads constantly update some particular shared variable—if threads that see their CAS fail use exponential backoff—in other words, wait a little before retrying the CAS.[4]

Can the Wikipedia article be understood, it has been found out but it's not employed yet or is it common practice that CAS instructions artificially backoff after they have failed. Is this the reason such a loop is not considered dangerous regarding CPU usage or because the CAS isn't constantly contested?

Second question: is there any specific reason a reference to seed is created or could we also simply use the variable from the class scope?

like image 365
Kilian Avatar asked Sep 14 '18 19:09

Kilian


1 Answers

Multiple threads trying to CAS something is lock-free (but not wait-free). One of the threads will make progress every time they all try with the same old value. https://en.wikipedia.org/wiki/Non-blocking_algorithm.

(Whether multiple threads all read the same old value or whether some see the result of another thread's CAS depends on timing, and is basically what determines how much contention there is.)

This is unlike a normal busy-wait loop which is just waiting for some unknown-length operation and could be stuck indefinitely if the thread holding a lock is descheduled. In that case, you definitely want to back off if your CAS fails to acquire a lock, because you definitely have to wait for another thread to do something before you can succeed.


Usually lockless algorithms are used in low-contention situations where sophisticated exponential-backoff isn't really needed. That's what the linked SO answer is saying.

This is a key difference from the situation mentioned in the Wiki article: where many threads constantly update some particular shared variable. That's a high-contention situation, so it's probably better to let one thread do a bunch of updates in a row and keep the line hot in their L1d cache. (Assuming you're using CAS to implement an atomic operation that the hardware doesn't support directly, like an atomic double-precision FP add, where you shared.CAS(old, old+1.0) or something. Or as part of a lockless queue or something.)

If you were using a CAS loop that was highly contended in practice, it might help total throughput some to back off and e.g. run an x86 pause instruction on failure before trying again, to have fewer cores hammering on a cache line. Or for a lockless queue, if you find it was full or empty, then that's basically a waiting-for-another-thread situation so you should definitely back off.


Most architectures other than x86 have LL/SC as their atomic RMW primitive, not a direct hardware CAS. Building CAS out of LL/SC can spuriously fail if other threads are even reading the cache line during the CAS attempt, so it might not be guaranteed that at least one thread succeed.

Hopefully hardware designers try to make CPUs that make LL/SC resist spurious failure from contention, but I don't know the details. In this case, backoff might help to avoid potential livelock.

(On hardware where CAS can't spuriously fail from contention, livelock is impossible for something like while(!shared.CAS(old, old<<1)){}.)


Intel's optimization manual has an example of waiting for a lock to become free, where they loop 1 << retry_count times (up to some max backoff factor) Note that this is not a normal CAS loop that's part of a lockless algorithm; this is for implementing a lock.

The backoff is waiting for the lock to become free, not just for contention for access to the cache line containing the lock itself.

  /// Intel's optimization manual
  /// Example 2-2. Contended Locks with Increasing Back-off Example

  /// from section 2.2.4 Pause Latency in Skylake Microarchitecture
  /// (~140 cycles, up from ~10 in Broadwell, thus max backoff should be shorter)
/*******************/
/*Baseline Version */
/*******************/
// atomic {if (lock == free) then change lock state to busy}
while (cmpxchg(lock, free, busy) == fail)
{
   while (lock == busy)
     _mm_pause();
}


/*******************/
/*Improved Version */
/*******************/
int mask = 1;
int const max = 64; //MAX_BACKOFF
while (cmpxchg(lock, free, busy) == fail)
{
   while (lock == busy)
   {
      for (int i=mask; i; --i){
         _mm_pause();
      }
      mask = mask < max ? mask<<1 : max;    // mask <<= 1  up to a max
   }
}

I thought normally when waiting for a lock, you want to spin read-only instead of keeping trying with cmpxchg. I think this example from Intel is only demonstrating backoff, not other parts of how to optimize a lock to avoid delaying the unlocking thread.

Anyway, remember that example is not like what we're talking about with a lockless queue or a CAS-retry implementation of an atomic add or other primitive. It's waiting for another thread to release a lock, not just failure from using the new value that appeared between reading the old value and attempting to CAS in a new value.

like image 83
Peter Cordes Avatar answered Nov 15 '22 00:11

Peter Cordes