How "lock add" is implemented on x86 processors

Tags:

I recently benchmarked std::atomic::fetch_add vs std::atomic::compare_exchange_strong on a 32 core Skylake Intel processor. Unsurprisingly (from the myths I've heard about fetch_add), fetch_add is almost an order of magnitude more scalable than compare_exchange_strong. Looking at the disassembly of the program std::atomic::fetch_add is implemented with a lock add and std::atomic::compare_exchange_strong is implemented with lock cmpxchg (https://godbolt.org/z/qfo4an).

What makes lock add so much faster on an intel multi-core processor? From my understanding, the slowness in both instructions comes from contention on the cacheline, and to execute both instructions with sequential consistency, the executing CPU has to pull the line into it's own core in exclusive or modified mode (from MESI). How then does the processor optimize fetch_add internally?

This is a simplified version of the benchmarking code. There was no load+CAS loop for the compare_exchange_strong benchmark, just a compare_exchange_strong on the atomic with an input variable that kept getting varied by thread and iteration. So it was just a comparison of instruction throughput under contention from multiple CPUs.

652

asked Dec 30 '19 19:12

Curious

1 Answers

lock add and lock cmpxchg both work essentially the same way, by holding onto that cache line in Modified state for the duration of the microcoded instruction. (Can num++ be atomic for 'int num'?). According to Agner Fog's instruction tables, lock cmpxchg and lock add are very similar numbers of uops from microcode. (Although lock add is slightly simpler). Agner's throughput numbers are for the uncontended case, where the var stays hot in L1d cache of one core. And cache misses can cause uop replays, but I don't see any reason to expect a significant difference.

You claim you aren't doing load+CAS or using a retry loop. But is it possible you're only counting successful CAS or something? On x86, every CAS (including failures) has almost identical cost to lock add. (With all your threads hammering on the same atomic variable, you'll get lots of CAS failures from using a stale value for expected. This is not the usual use-case for CAS retry loops).

Or does your CAS version actually do a pure load from the atomic variable to get an expected value? That might be leading to memory-order mis-speculation.

You don't have complete code in the question so I have to guess, and couldn't try it on my desktop. You don't even have any perf-counter results or anything like that; there are lots of perf events for off-core memory access, and events like mem_inst_retired.lock_loads that could record number of locked instructions executed.

With lock add, every time a core gets ownership of the cache line, it succeeds at doing an increment. Cores are only waiting for HW arbitration of access to the line, never for another core to get the line and then fail to increment because it had a stale value.

It's plausible that HW arbitration could treat lock add and lock cmpxchg differently, e.g. perhaps letting a core hang onto the line for long enough to do a couple lock add instructions.

Is that what you mean?

Or maybe you have some major failure in microbenchmark methodology, like maybe not doing a warm-up loop to get CPU frequency up from idle before starting your timing? Or maybe some threads happen to finish early and let the other threads run with less contention?

answered Oct 01 '22 03:10

Peter Cordes

Related questions
                            
                                My assumption is that the code below ill-formed NDR? But why?
                            
                                Streaming data into a Node.js C++ Addon with N-API
                            
                                C++ Get binary value of a buffer
                            
                                How to select an older compiler in Microsoft Visual C++ 2019?
                            
                                std::is_same returns false when comparing std::tuple_element and decltype(std::get)
                            
                                Why does using uniform initializer syntax result in different behavior to the "old" style ()?
                            
                                java.lang.UnsatisfiedLinkError: No implementation found for java.lang.String when proguard enabled
                            
                                How to cancel a wasm process from within a webworker
                            
                                Calling non-const function of another class by reference from const function [duplicate]
                            
                                How is `auto a(b);` parsed in C++? [duplicate]
                            
                                Use cases for function type typedefs with qualifiers
                            
                                Why is std::uniform_int_distribution<IntType>::operator() not const?
                            
                                [[deprecated]] results in error instead of warning in Visual Studio
                            
                                CMake: How to add a dependency that is not a "link" dependency
                            
                                I am trying to create a C++ map with a vector value inside a boost memory mapped file
                            
                                reinterpret_cast member function pointer to void(*&)()
                            
                                How to feed CMake with vcpkg information?
                            
                                Same clang, different results for std::initializer_list program with -std=c++14/-std=c++17
                            
                                Difference between passing an array as formal parameter as int arr[] and int arr[N] in C/C++
                            
                                Select unary vs. binary std::transform function overload automatically based on callable's signature

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How "lock add" is implemented on x86 processors

Tags:

c++

cpu-architecture

x86

assembly

atomic

Curious

People also ask

1 Answers

Peter Cordes

Recent Activity

Donate For Us