Does cmpxchg write destination cache line on failure? If not, is it better than xchg for spinlock?

Question

I assume simple spinlock that does not go to OS waiting for the purposes of this question.

I see that simple spinlock is often implemented using lock xchg or lock bts instead of lock cmpxchg.

But doesn't cmpxchg avoid writing the value if the expectation does not match? So aren't failed attempts cheaper with cmpxchg?

Or does cmpxchg write data and invalidate cache line of other cores even on failure?

This question is similar to What specifically marks an x86 cache line as dirty - any write, or is an explicit change required?, but it is specific to cmpxchg, not in general.

Hadi Brais · Accepted Answer

On most or all current Intel x86 processors, a lock cmpxchg to a location whose memory type is WB and is fully contained within a single L1D cache line is executed as follows:

A lock-read request is issued to the L1D, which brings the target line in a locked-exclusive cache coherence state and provides the requested bytes as input to one of the execution ports to perform the comparison. (Cache locking is supported since the P6.) A line in a locked state cannot be invalidated or evicted for any reason.
Perform the comparison for equality.
Whatever the result is, issue a an unlock-write request to the L1D, which changes the state of the cache line to Modified and unlocks the line, thereby allowing other access or coherence requests to replace or invalidate the line.

The first and last steps can be observed empirically using either certain performance events or latency-based measurements. One way would be to allocate a large array of atomic variables and then execute lock cmpxchg in a loop over that array. The lock-read request type is one of the types of RFO requests. So the L2_TRANS.RFO event (or what's equivalent), which is reliable on most microarchitectures, can be used to measure the number of lock-reads to the L2. (L2_TRANS.RFO counts demand RFOs, so it's better to turn off the hardware prefetchers to avoid unwanted hits in the L2. This also applies to L2_RQSTS.RFO_*.)

There are also events for measuring the number of writebacks, such as L2_TRANS.L1D_WB, L2_TRANS.L2_WB, and others. Unfortunately, many of these events and across many microarchtiectures either undercount, overcount, or they count accurately but not necessarily all/only dirty cache line writebacks. So they are more difficult to reason with and in general not reliable.

A better way would be to execute lock cmpxchg on one section of the array on a particular physical core, then migrate the thread to another physical core (in the same L3 sharing domain) and execute a loop in which the elements of that section are read (normal reads). If the lock cmpxchg instruction puts the target line in the M state, a read request from another physical core in the same L3 sharing domain should hit in the L3 and also hit-modified in the private caches of the core on which lock cmpxchg was executed. These events can be counted using OFFCORE_RESPONSE.DEMAND_DATA_RD.L3_HIT.HITM_OTHER_CORE (or what's equivalent), which is reliable on most/all microarchitectures.

A locked instruction is an expensive operation for three reasons: (1) Requires bringing the line in an exclusive state, (2) Makes the line dirty (possibly unnecessarily) and too many writebacks can have a significant impact on execution time, even more so when they end up stealing main memory bandwidth from long stretches of read requests, and even more so when the writes are to persistent memory, and (3) They are architecturally serializing, which makes the instruction on the critical path.

Intel has a patent that proposes an optimization for the last one, where the core optimistically assumes that there is no lock contention and issues a speculative normal load to the target line. If the line is not present in any other physical core, the line will be in an exclusive state in the requesting core. Then when the locked instruction executes and issues the lock-read request, the line would hopefully still be in the exclusive state, in which case the total latency of the locked instruction would be reduced. I don't know if any processor implements this optimization. If it's implemented, the number of L2_TRANS.RFO events would be much smaller than the number of lines locked.

Alex Guteniev · Answer

I made some tests. Very synthetic though, did a very little under a lock, and measured throughput of very contended scenario.

So far, no steady effect of difference between lock bts xchg or lock cmpxchg was observed.

Other stuff however had some effect:

Inner load loop is definitely helpful, both with and without pause
One pause in a loop is helpful, both with and without load loop
Load loop helps more than pause
The best results are achieved by applying "Improved version" from Intel® 64 and IA-32 Architectures Optimization Reference Manual (see below)
Starting with load instead of RMW/CAS has controversial effect: it is helpful for tests without pause, but degrades performance of tests with pause

Intel® 64 and IA-32 Architectures Optimization Reference Manual recommend using pause.

Example 2-4. Contended Locks with Increasing Back-off Example shows baseline version:

/*******************/
/*Baseline Version */
/*******************/
// atomic {if (lock == free) then change lock state to busy}
while (cmpxchg(lock, free, busy) == fail)
{
 while (lock == busy)
 {
 __asm__ ("pause");
 }
}

and improved version:

/*******************/
/*Improved Version */
/*******************/
int mask = 1;
int const max = 64; //MAX_BACKOFF
while (cmpxchg(lock, free, busy) == fail)
{
 while (lock == busy)
 {
   for (int i=mask; i; --i){
     __asm__ ("pause");
   }
   mask = mask < max ? mask<<1 : max;
 }
}

Windows SRWLOCK may also be a good example to follow. It uses load loop, and pause. it starts with interlocked operation lock bts for acquire exclusive, lock cmpxchg for acquire shared. Even TryAcquireSRWLockExclusive does only lock bts:

RtlTryAcquireSRWLockExclusive:
00007FFA86D71370  lock bts    qword ptr [rcx],0  
00007FFA86D71376  setae       al  
00007FFA86D71379  ret

It doesn't however implement exponentially growing pause in waiting versions. It does some small amount of loads with one pause, then goes to OS wait.

Does cmpxchg write destination cache line on failure? If not, is it better than xchg for spinlock?

Tags:

x86

assembly

cpu-cache

intel

micro-optimization

compare-and-swap

Alex Guteniev

2 Answers

Hadi Brais

Alex Guteniev

Recent Activity

Donate For Us

Does cmpxchg write destination cache line on failure? If not, is it better than xchg for spinlock?

Tags:

x86

assembly

cpu-cache

intel

micro-optimization

compare-and-swap

Alex Guteniev

2 Answers

Hadi Brais

Alex Guteniev

Related questions

Recent Activity

Donate For Us