I originally thought cache coherency protocols such as MESI can provide pseudo-atomicity but only across individual memory-load/store instructions. If I was performing a fetch, modify, write combination of instructions, MESI-alone wouldn't be able to enforce atomicity across the first instruction to the last.
However, section 8 of the Intel reference manual Vol 3a says:
8.1.4 Effects of a LOCK Operation on Internal Processor Caches
For the P6 and more recent processor families, if the area of memory being locked during a LOCK operation is cached in the processor that is performing the LOCK operation as write-back memory and is completely contained in a cache line, the processor may not assert the LOCK# signal on the bus. Instead, it will modify the memory location internally and allow it’s cache coherency mechanism to ensure that the operation is carried out atomically. This operation is called “cache locking.” The cache coherency mechanism automatically prevents two or more processors that have cached the same area of memory from simultaneously modifying data in that area.
http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-vol-3a-part-1-manual.pdf
This seems to contradict my understanding by implying the LOCK instruction doesn't need to be used as cache coherency can be used?
The Cache Coherence Problem As multiple processors operate in parallel, and independently multiple caches may possess different copies of the same memory block, this creates cache coherence problem. Cache coherence schemes help to avoid this problem by maintaining a uniform state for each cached block of data.
Coherence protocols apply cache coherence in multiprocessor systems. The intention is that two clients must never see different values for the same shared data. The protocol must implement the basic requirements for coherence. It can be tailor-made for the target system or application.
Cache coherency is a situation where multiple processor cores share the same memory hierarchy, but have their own L1 data and instruction caches. Incorrect execution could occur if two or more copies of a given cache block exist, in two processors' caches, and one of these blocks is modified.
Cache coherence refers to the problem of keeping the data in these caches consistent. The main problem is dealing with writes by a processor. There are two general strategies for dealing with writes to a cache: Write-through - all data written to the cache is also written to memory at the same time.
There's a difference between locking as a concept, and the actual bus #lock signal - the latter is one of the means of implementing the first. Cache locking is another one that is much simpler and more efficient.
MESI protocol guarantees that if a line is held exclusively by a certain core (either modified or not), no one else has it. In this case you can perform multiple operations atomically by adding simple flag in the cache that blocks external snoops until the operations are done. This would have the same effect as the lock concept dictates since no one else may change or even observe the intermediate values.
On more complicated cases, the line is not held by a single cache (for e.g. it may be shared between several ones, or the access may be split between two cache lines and only one is in your cache - the list of scenarios is usually implementation specific and probably not disclosed by the CPU manufacturer) - in such cases you may have to resort to "heavier" cannons like the bus lock, which usually guarantees no one can do anything on the shared bus. Obviously this has a huge impact on performance so this is probably only used when you have no other choice. In most cases a simple cache-level lock should be enough. Note that new schemes like Intel TSX seem to work in a similar manner, offering optimizations when you're working from within the cache.
By the way - your assumption about pseudo-atomicity for individual instruction is also wrong - it would be correct if you referred to a single memory operation (load or store), since an instruction may include multiple ones (inc [addr]
for e.g. would not be atomic without a lock). Another restriction which also appears in your quote is that the access needs to be contained in a cache line - split lines don't guarantee atomicity even within a single load or store (since they're usually implemented as 2 memory operations that are later merged).
Reading the excerpt you give, I don't find it contradictory to using of LOCK-ed instruction. For example, consider INC
instruction. Without the LOCK
, it can read the original value having its cache line in SHARED
state which does not prevent other cores on the same cache from concurrent reading of the same value before storing the same incremented result = data race.
I interpret the quote as the data integrity is guaranteed per cache line granularity, the additional care may not be necessary when the data fits one cache line. But if the the data crosses the boundary of two cache lines, it is necessary to assert that modifications for both of them will be treated atomically.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With