atomic operation cost

Tags:

What is the cost of the atomic operation (any of compare-and-swap or atomic add/decrement)? How much cycles does it consume? Will it pause other processors on SMP or NUMA, or will it block memory accesses? Will it flush reorder buffer in out-of-order CPU?

What effects will be on the cache?

I'm interested in modern, popular CPUs: x86, x86_64, PowerPC, SPARC, Itanium.

989

asked Mar 29 '10 13:03

osgx

1 Answers

I have looked for actual data for the past days, and found nothing. However, I did some research, which compares the cost of atomic ops with the costs of cache misses.

The cost of the x86 LOCK prefix, (including lock cmpxchg for atomic CAS), before PentiumPro (as described in the doc), is a memory access (like a cache miss), + stopping memory operations by other processors, + any contention with other processors trying to LOCK the bus. However, since PentiumPro, for normal Writeback cacheable memory (all memory an app deals with, unless you talk directly with hardware), instead of blocking all memory operations, only the relevant cache line is blocked (based on the link in @osgx's answer).

i.e. the core delays answering MESI share and RFO requests for the line until after the store part of the actual locked operation. This is called a "cache lock", and only affects that one cache line. Other cores can be loading / storing or even CASing other lines at the same time.

Actually, the CAS case can be more complicated, as explained on this page, with no timings but an insightful description by a trustworthy engineer. (At least for the normal use-case where you do a pure load before the actual CAS.)

Before going into too much detail, I'll say that a LOCKed operation costs one cache miss + the possible contention with other processor on the same cacheline, while CAS + the preceding load (which is almost always required except on mutexes, where you always CAS 0 and 1) can cost two cache misses.

He explains that a load + CAS on a single location can actually cost two cache misses, like Load-Linked/Store-Conditional (see there for the latter). His explaination relies on knowledge of the MESI cache coherence protocol. It uses 4 states for a cacheline: M(odified), E(xclusive), S(hared), I(nvalid) (and therefore it's called MESI), explained below where needed. The scenario, explained, is the following:

the LOAD causes a cache miss - the relevant cacheline is loaded from memory in Shared state (i.e. other processors are still allowed to keep that cacheline in memory; no changes are allowed in this state). If the location is in memory, this cache miss is skipped. Possible cost: 1 cache miss. (skipped if the cacheline is in Shared, Exclusive or Modified state, i.e. the data is in this CPU's L1 cache).
the program calculates the new values to store,
and it runs an atomic CAS instruction.
- It has to avoid concurrent modification, so it must remove copies of the cacheline from the cache of other CPUs, to move the cacheline to the Exclusive state. Possible cost: 1 cache miss. This is not needed if it is already exclusively owned, i.e. in the Exclusive or Modified state. In both states, no other CPUs hold the cacheline, but in the Exclusive state it has not been modified (yet).
- After this communication, the variable is modified in our CPU's local cache, at which point it is globally visible to all other CPUs (because their caches are coherent with ours). It will eventually be written to main memory according to usual algorithms.
- Other processors trying to read or modify that variable will first have to obtain that cacheline in Shared or Exclusive mode, and to do so will contact this processor and receive the updated version of the cacheline. A LOCKed operation, instead, can only cost a cache miss (because the cacheline will be requested directly in Exclusive state).

In all cases, a cacheline request can be stalled by other processors already modifying the data.

158

answered Oct 16 '22 17:10

Blaisorblade

Related questions
                            
                                UUID performance in MySQL?
                            
                                Float vs Double Performance
                            
                                Difference between Jaro-Winkler and Levenshtein distance? [closed]
                            
                                Interactive large plot with ~20 million sample points and gigabytes of data
                            
                                Why is any (True for ... if cond) much faster than any (cond for ...)?
                            
                                Android SQLite database: slow insertion
                            
                                Android RecyclerView Scrolling Performance
                            
                                Why is long slower than int in x64 Java?
                            
                                Speedup IntelliJ-Idea
                            
                                Fastest way to generate a random boolean
                            
                                Java switch cases: with or without braces?
                            
                                Entity Framework is Too Slow. What are my options? [closed]
                            
                                How many rows in a database are TOO MANY?
                            
                                Why is memmove faster than memcpy?
                            
                                What is the best way to profile javascript execution? [closed]
                            
                                What is the difference between spark.sql.shuffle.partitions and spark.default.parallelism?
                            
                                Split a large dataframe into a list of data frames based on common value in column
                            
                                Blazor performance
                            
                                Should a developer aim for readability or performance first? [closed]
                            
                                What is the overhead of Rust's Option type?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

atomic operation cost

Tags:

performance

cpu-architecture

atomic

lock-free

osgx

People also ask

1 Answers

Blaisorblade

Recent Activity

Donate For Us