I assume simple spinlock that does not go to OS waiting for the purposes of this question.
I see that simple spinlock is often implemented using lock xchg
or lock bts
instead of lock cmpxchg
.
But doesn't cmpxchg
avoid writing the value if the expectation does not match? So aren't failed attempts cheaper with cmpxchg
?
Or does cmpxchg
write data and invalidate cache line of other cores even on failure?
This question is similar to What specifically marks an x86 cache line as dirty - any write, or is an explicit change required?, but it is specific to cmpxchg
, not in general.
On most or all current Intel x86 processors, a lock cmpxchg
to a location whose memory type is WB and is fully contained within a single L1D cache line is executed as follows:
The first and last steps can be observed empirically using either certain performance events or latency-based measurements. One way would be to allocate a large array of atomic variables and then execute lock cmpxchg
in a loop over that array. The lock-read request type is one of the types of RFO requests. So the L2_TRANS.RFO
event (or what's equivalent), which is reliable on most microarchitectures, can be used to measure the number of lock-reads to the L2. (L2_TRANS.RFO
counts demand RFOs, so it's better to turn off the hardware prefetchers to avoid unwanted hits in the L2. This also applies to L2_RQSTS.RFO_*
.)
There are also events for measuring the number of writebacks, such as L2_TRANS.L1D_WB
, L2_TRANS.L2_WB
, and others. Unfortunately, many of these events and across many microarchtiectures either undercount, overcount, or they count accurately but not necessarily all/only dirty cache line writebacks. So they are more difficult to reason with and in general not reliable.
A better way would be to execute lock cmpxchg
on one section of the array on a particular physical core, then migrate the thread to another physical core (in the same L3 sharing domain) and execute a loop in which the elements of that section are read (normal reads). If the lock cmpxchg
instruction puts the target line in the M state, a read request from another physical core in the same L3 sharing domain should hit in the L3 and also hit-modified in the private caches of the core on which lock cmpxchg
was executed. These events can be counted using OFFCORE_RESPONSE.DEMAND_DATA_RD.L3_HIT.HITM_OTHER_CORE
(or what's equivalent), which is reliable on most/all microarchitectures.
A locked instruction is an expensive operation for three reasons: (1) Requires bringing the line in an exclusive state, (2) Makes the line dirty (possibly unnecessarily) and too many writebacks can have a significant impact on execution time, even more so when they end up stealing main memory bandwidth from long stretches of read requests, and even more so when the writes are to persistent memory, and (3) They are architecturally serializing, which makes the instruction on the critical path.
Intel has a patent that proposes an optimization for the last one, where the core optimistically assumes that there is no lock contention and issues a speculative normal load to the target line. If the line is not present in any other physical core, the line will be in an exclusive state in the requesting core. Then when the locked instruction executes and issues the lock-read request, the line would hopefully still be in the exclusive state, in which case the total latency of the locked instruction would be reduced. I don't know if any processor implements this optimization. If it's implemented, the number of L2_TRANS.RFO
events would be much smaller than the number of lines locked.
I made some tests. Very synthetic though, did a very little under a lock, and measured throughput of very contended scenario.
So far, no steady effect of difference between lock bts
xchg
or lock cmpxchg
was observed.
Other stuff however had some effect:
load
loop is definitely helpful, both with and without pause
pause
in a loop is helpful, both with and without load looppause
, but degrades performance of tests with pause
Intel® 64 and IA-32 Architectures Optimization Reference Manual recommend using pause
.
Example 2-4. Contended Locks with Increasing Back-off Example shows baseline version:
/*******************/
/*Baseline Version */
/*******************/
// atomic {if (lock == free) then change lock state to busy}
while (cmpxchg(lock, free, busy) == fail)
{
while (lock == busy)
{
__asm__ ("pause");
}
}
and improved version:
/*******************/
/*Improved Version */
/*******************/
int mask = 1;
int const max = 64; //MAX_BACKOFF
while (cmpxchg(lock, free, busy) == fail)
{
while (lock == busy)
{
for (int i=mask; i; --i){
__asm__ ("pause");
}
mask = mask < max ? mask<<1 : max;
}
}
Windows SRWLOCK
may also be a good example to follow. It uses load loop, and pause
. it starts with interlocked operation lock bts
for acquire exclusive, lock cmpxchg
for acquire shared. Even TryAcquireSRWLockExclusive
does only lock bts
:
RtlTryAcquireSRWLockExclusive:
00007FFA86D71370 lock bts qword ptr [rcx],0
00007FFA86D71376 setae al
00007FFA86D71379 ret
It doesn't however implement exponentially growing pause
in waiting versions. It does some small amount of loads with one pause
, then goes to OS wait.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With