Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is LOCK a full barrier on x86?

Why does the LOCK prefix cause a full barrier on x86? (And thus it drains the store buffer and has sequential consistency)

For LOCK/read-modify-write operations, a full barrier shouldn't be required and exclusive access to the cache line seems to be sufficient. Is it a design choice or is there some other limitation?

like image 497
yggdrasil Avatar asked Feb 21 '20 05:02

yggdrasil


1 Answers

Long time ago, before the Intel 80486, Intel processors didn't have on-chip caches or write buffers. Therefore, by design, all writes become immediately globally visible in order and you didn't have to drain stores from anywhere. A locked transaction is executed by fully locking the bus for the entire address space.

In the 486 and Pentium processors, write buffers have been added on-chip and some models have on-chip caches as well. Consider first the models that don't have on-chip caches. All writes are temporarily held in on-chip write buffers until they are written on the bus when available or a serializing event occurs. Remember that atomic RMW transactions are used to acquire exclusive access to software structures or hardware resources. So if a processor performs a locked transaction, it shouldn't happen that the processor thinks that it got granted ownership of the resource but then another processor also somehow ends up obtaining ownership as well. If the write part of the locked transaction gets buffered in a write buffer and then the bus lock is relinquished, there is nothing that prevents other agents from also acquiring access to the resource at the same time. Essentially, the write part has to be made visible to all other agents and the way to do this is by not buffering it. But the x86 memory model requires that all writes become globally visible in order (there was no weak ordering on these processors). So in order to make the write part of a locked transaction globally observable, all buffered writes had also be made globally observable in the same order.

Some 486 models and all Pentium processors have on-chip caches. But on these processor, there was no support for cache locks. That's why locked transactions were not cacheable on these processors because the only way to guarantee atomicity was to bypass the cache and lock the bus. After acquiring the bus lock, one or more writes are performed depending on the alignment and size of the destination memory region. The write buffers still have to be drained before releasing the bus lock.

The Pentium Pro introduced some major changes including weakly-ordered writes, write-combining buffers, and cache locking. What was called "writes buffers" is what is usually referred to as store buffers on more modern microarchitectures. A locked transaction utilizes cache locking on these processors, but the cache lock cannot be released until committing the locked store from the store buffer to the cache, which makes the store globally observable, which necessarily requires making all earlier stores globally observable. These events have to happen in that order. That said, I don't think locked transactions have to serialize weakly-ordered writes, but Intel has decided to make them this way. Maybe because Intel wanted a convenient instruction that drains WC buffers on the PPro in the absence of a dedicated store fence.

like image 188
Hadi Brais Avatar answered Nov 19 '22 16:11

Hadi Brais