When can the CPU ignore the LOCK prefix and use cache coherency?

Tags:

I originally thought cache coherency protocols such as MESI can provide pseudo-atomicity but only across individual memory-load/store instructions. If I was performing a fetch, modify, write combination of instructions, MESI-alone wouldn't be able to enforce atomicity across the first instruction to the last.

However, section 8 of the Intel reference manual Vol 3a says:

8.1.4 Effects of a LOCK Operation on Internal Processor Caches

For the P6 and more recent processor families, if the area of memory being locked during a LOCK operation is cached in the processor that is performing the LOCK operation as write-back memory and is completely contained in a cache line, the processor may not assert the LOCK# signal on the bus. Instead, it will modify the memory location internally and allow it’s cache coherency mechanism to ensure that the operation is carried out atomically. This operation is called “cache locking.” The cache coherency mechanism automatically prevents two or more processors that have cached the same area of memory from simultaneously modifying data in that area.

http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-vol-3a-part-1-manual.pdf

This seems to contradict my understanding by implying the LOCK instruction doesn't need to be used as cache coherency can be used?

407

asked Aug 24 '14 21:08

user997112

2 Answers

There's a difference between locking as a concept, and the actual bus #lock signal - the latter is one of the means of implementing the first. Cache locking is another one that is much simpler and more efficient.

MESI protocol guarantees that if a line is held exclusively by a certain core (either modified or not), no one else has it. In this case you can perform multiple operations atomically by adding simple flag in the cache that blocks external snoops until the operations are done. This would have the same effect as the lock concept dictates since no one else may change or even observe the intermediate values.

On more complicated cases, the line is not held by a single cache (for e.g. it may be shared between several ones, or the access may be split between two cache lines and only one is in your cache - the list of scenarios is usually implementation specific and probably not disclosed by the CPU manufacturer) - in such cases you may have to resort to "heavier" cannons like the bus lock, which usually guarantees no one can do anything on the shared bus. Obviously this has a huge impact on performance so this is probably only used when you have no other choice. In most cases a simple cache-level lock should be enough. Note that new schemes like Intel TSX seem to work in a similar manner, offering optimizations when you're working from within the cache.

By the way - your assumption about pseudo-atomicity for individual instruction is also wrong - it would be correct if you referred to a single memory operation (load or store), since an instruction may include multiple ones (inc [addr] for e.g. would not be atomic without a lock). Another restriction which also appears in your quote is that the access needs to be contained in a cache line - split lines don't guarantee atomicity even within a single load or store (since they're usually implemented as 2 memory operations that are later merged).

114

answered Sep 20 '22 18:09

Leeor

Reading the excerpt you give, I don't find it contradictory to using of LOCK-ed instruction. For example, consider INC instruction. Without the LOCK, it can read the original value having its cache line in SHARED state which does not prevent other cores on the same cache from concurrent reading of the same value before storing the same incremented result = data race.

I interpret the quote as the data integrity is guaranteed per cache line granularity, the additional care may not be necessary when the data fits one cache line. But if the the data crosses the boundary of two cache lines, it is necessary to assert that modifications for both of them will be treated atomically.

answered Sep 19 '22 18:09

Anton

Related questions
                            
                                parBuffer evaluation not giving expected speedup
                            
                                When performing a calculation - how many threads should I open?
                            
                                Naming (general purpose) thread-safe data structures?
                            
                                Perl async tasks for "any" code, no matter what it is?
                            
                                Android: Which thread calls .onSensorChanged?
                            
                                What's the purpose of SignalObjectAndWait regards there is SetEvent and WaitForSingleObject?
                            
                                Why does pthread_cond_timedwait doc talk about an "unavoidable race"?
                            
                                Synchronized , volatile and thread safety
                            
                                Cordova plugin blocking thread
                            
                                Multithreaded Unzipping In Java
                            
                                StoreLoad Memory Barrier
                            
                                concurrent.futures.ThreadPoolExecutor.map is slower than a for loop
                            
                                Are assignments to non-volatile member variables in one thread guaranteed to be seen in another thread?
                            
                                Java - terminating a method within a Thread
                            
                                Data race and Race condition during lazy initializing a SIngleton
                            
                                C++11 Dynamic Threadpool
                            
                                fast encode bitmap buffer to png using libpng
                            
                                JDK implementation of Thread.join()
                            
                                Difference between event queue and message queue
                            
                                How to multithread "tail call" recursion using TBB

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

When can the CPU ignore the LOCK prefix and use cache coherency?

Tags:

caching

x86

multithreading

concurrency

cpu

user997112

People also ask

2 Answers

Leeor

Anton

Recent Activity

Donate For Us