I recently benchmarked std::atomic::fetch_add
vs std::atomic::compare_exchange_strong
on a 32 core Skylake Intel processor. Unsurprisingly (from the myths I've heard about fetch_add), fetch_add is almost an order of magnitude more scalable than compare_exchange_strong. Looking at the disassembly of the program std::atomic::fetch_add
is implemented with a lock add
and std::atomic::compare_exchange_strong
is implemented with lock cmpxchg
(https://godbolt.org/z/qfo4an).
What makes lock add
so much faster on an intel multi-core processor? From my understanding, the slowness in both instructions comes from contention on the cacheline, and to execute both instructions with sequential consistency, the executing CPU has to pull the line into it's own core in exclusive or modified mode (from MESI). How then does the processor optimize fetch_add internally?
This is a simplified version of the benchmarking code. There was no load+CAS loop for the compare_exchange_strong benchmark, just a compare_exchange_strong on the atomic with an input variable that kept getting varied by thread and iteration. So it was just a comparison of instruction throughput under contention from multiple CPUs.
The solution was to use a #lock signal to prevent anything else from accessing the same piece of memory at the same time.
x86 guarantees that aligned loads and stores up to 64 bits are atomic, but not wider accesses.
Different Modes of Operation Some basic architectural features that the x86 processor includes various modes of operation. These processors have three modes of operation that are primarily used: protected mode, real-address mode, as well as a system management mode.
Bus locking refers to the locking of the memory bus, so no other process can access memory (perhaps only at the location or cache line in question) while the bus is locked. In x86 it's effected by the LOCK prefix (which is implied for a memory-XCHG).
lock add
and lock cmpxchg
both work essentially the same way, by holding onto that cache line in Modified state for the duration of the microcoded instruction. (Can num++ be atomic for 'int num'?). According to Agner Fog's instruction tables, lock cmpxchg
and lock add
are very similar numbers of uops from microcode. (Although lock add
is slightly simpler). Agner's throughput numbers are for the uncontended case, where the var stays hot in L1d cache of one core. And cache misses can cause uop replays, but I don't see any reason to expect a significant difference.
You claim you aren't doing load+CAS or using a retry loop. But is it possible you're only counting successful CAS or something? On x86, every CAS (including failures) has almost identical cost to lock add
. (With all your threads hammering on the same atomic variable, you'll get lots of CAS failures from using a stale value for expected
. This is not the usual use-case for CAS retry loops).
Or does your CAS version actually do a pure load from the atomic variable to get an expected
value? That might be leading to memory-order mis-speculation.
You don't have complete code in the question so I have to guess, and couldn't try it on my desktop. You don't even have any perf-counter results or anything like that; there are lots of perf events for off-core memory access, and events like mem_inst_retired.lock_loads
that could record number of lock
ed instructions executed.
With lock add
, every time a core gets ownership of the cache line, it succeeds at doing an increment. Cores are only waiting for HW arbitration of access to the line, never for another core to get the line and then fail to increment because it had a stale value.
It's plausible that HW arbitration could treat lock add
and lock cmpxchg
differently, e.g. perhaps letting a core hang onto the line for long enough to do a couple lock add
instructions.
Is that what you mean?
Or maybe you have some major failure in microbenchmark methodology, like maybe not doing a warm-up loop to get CPU frequency up from idle before starting your timing? Or maybe some threads happen to finish early and let the other threads run with less contention?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With