When reading about consistency models (namely the x86 TSO), authors in general resort to models where there are a bunch of CPUs, their associated store buffers and their private caches.
If my understanding is correct, store buffers can be described as queues where CPUs may put any store instruction they want to commit to memory. So as the name states, they are store
buffers.
But when I read those papers, they tend to talk about the interaction of loads and stores, with statements such as "a later load can pass an earlier store" which is slightly confusing, as they almost seem to be talking as if the store buffer would have both loads and stores, when it doesn't -- right?
So there must be also be a load store that they are not (at least explicitly) talking about. Plus, those two must be somehow synchronized, so both know when it's acceptable to load from memory and to commit to memory -- or am I missing something?
Can anyone shed some more light into this?
EDIT:
Let's look at a paragraph out of "A primer on memory consistency and cache coherence":
To understand the implementation of atomic RMWs in TSO, we consider the RMW as a load immediately followed by a store. The load part of the RMW cannot pass earlier loads due to TSO’s ordering rules. It might at first appear that the load part of the RMW could pass earlier stores in the write buffer, but this is not legal. If the load part of the RMW passes an earlier store, then the store part of the RMW would also have to pass the earlier store because the RMW is an atomic pair. But because stores are not allowed to pass each other in TSO, the load part of the RMW cannot pass an earlier store either
more specifically,
The load part of the RMW cannot pass earlier loads due to TSO’s ordering rules. It might at first appear that the load part of the RMW could pass earlier stores in the write buffer
so they are referring to loads / stores crossing each other in the write buffer (which I assume is the same thing as the store buffer?)
Thanks
Yes, write buffer = store buffer.
They're talking about if an atomic RMW was split up into a separate load and store, and the store buffer delayed another store (to a separate address) so it was after the load but still before the store.
Obviously that would make it non-atomic, and violate the requirement that all x86 atomic RMW operations are also full barriers. (The lock
prefix implies that, too.)
Normally it would be hard for a reader to detect that, but if the "separate address" was contiguous with the atomic RMW, then e.g. a dword store + a dword RMW could be observed by another thread doing a 64-bit qword load of both as one atomic operation.
re: the title question:
Load buffers don't cause reordering. They wait for data that hasn't arrived yet; the load finishes "executing" when it reads data.
Store buffers are fundamentally different; they hold data for some time before it becomes globally visible.
x86's TSO memory model can be described as sequential-consistency + a store-buffer (with store-forwarding). See also x86 mfence and C++ memory barrier and comments on that answer for more discussion about the fact that merely allowing StoreLoad reordering is not a sufficient description for cases where a thread reloads data that it just stored, especially if a load partially overlaps with recent stores so the HW merges data from the store buffer with data from L1d to complete the load before the store is globally visible.
Also note that x86 CPUs speculatively do reorder loads (at least Intel's do), but shoot down the mis-speculation to preserve the TSO memory model of no LoadLoad or LoadStore reordering. CPUs thus have to track loads vs. store ordering. Intel calls the combined store+load buffer tracking structure the "memory order buffer" (MOB). See Size of store buffers on Intel hardware? What exactly is a store buffer? for more.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With