How do the store buffer and Line Fill Buffer interact with each other?

Tags:

I was reading the MDS attack paper RIDL: Rogue In-Flight Data Load. They discuss how the Line Fill Buffer can cause leakage of data. There is the About the RIDL vulnerabilities and the "replaying" of loads question that discusses the micro-architectural details of the exploit.

One thing that isn't clear to me after reading that question is why we need a Line Fill Buffer if we already have a store buffer.

John McCalpin discusses how the store buffer and Line Fill Buffer are connected in How does WC-buffer relate to LFB? on the Intel forums, but that doesn't really make things clearer to me.

For stores to WB space, the store data stays in the store buffer until after the retirement of the stores. Once retired, data can written to the L1 Data Cache (if the line is present and has write permission), otherwise an LFB is allocated for the store miss. The LFB will eventually receive the "current" copy of the cache line so that it can be installed in the L1 Data Cache and the store data can be written to the cache. Details of merging, buffering, ordering, and "short cuts" are unclear.... One interpretation that is reasonably consistent with the above would be that the LFBs serve as the cacheline-sized buffers in which store data is merged before being sent to the L1 Data Cache. At least I think that makes sense, but I am probably forgetting something....

I've just recently started reading up on out-of-order execution so please excuse my ignorance. Here is my idea of how a store would pass through the store buffer and Line Fill Buffer.

A store instruction get scheduled in the front-end.
It executes in the store unit.
The store request is put in the store buffer (an address and the data)
An invalidate read request is sent from the store buffer to the cache system
If it misses the L1d cache, then the request is put in the Line Fill Buffer
The Line Fill Buffer forwards the invalidate read request to L2
Some cache receives the invalidate read and sends its cache line
The store buffer applies its value to the incoming cache line
Uh? The Line Fill Buffer marks the entry as invalid

enter image description here

Questions

Why do we need the Line Fill Buffer if the store buffer already exists to track outsanding store requests?
Is the ordering of events correct in my description?

602

asked Apr 09 '20 20:04

Daniel Näslund

1 Answers

Why do we need the Line Fill Buffer if the store buffer already exists to track outsanding store requests?

The store buffer is used to track stores, in order, both before they retire and after they retire but before they commit to the L1 cache². The store buffer conceptually is a totally local thing which doesn't really care about cache misses. The store buffer deals in "units" of individual stores of various sizes. Chips like Intel Skylake have store buffers of 50+ entries.

The line fill buffers primary deal with both loads and stores that miss in the L1 cache. Essentially, it is the path from the L1 cache to the rest of the memory subsystem and deals in cache line sized units. We don't expect the LFB to get involved if the load or store hits in the L1 cache¹. Intel chips like Skylake have many fewer LFB entries, probably 10 to 12.

Is the ordering of events correct in my description?

Pretty close. Here's how I'd change your list:

A store instructions gets decoded and split into store-data and store-address uops, which are renamed, scheduled and have a store buffer entry allocated for them.
The store uops execute in any order or simultaneously (the two sub-items can execute in either order depending mostly on which has its dependencies satisfied first).
1. The store data uop writes the store data into the store buffer.
2. The store address uop does the V-P translation and writes the address(es) into the store buffer.
At some point when all older instructions have retired, the store instruction retires. This means that the instruction is no longer speculative and the results can be made visible. At this point, the store remains in the store buffer and is called a senior store.
The store now waits until it is at the head of the store buffer (it is the oldest not committed store), at which point it will commit (become globally observable) into the L1, if the associated cache line is present in the L1 in MESIF Modified or Exclusive state. (i.e. this core owns the line)
If the line is not present in the required state (either missing entirely, i.e,. a cache miss, or present but in a non-exclusive state), permission to modify the line and the line data (sometimes) must be obtained from the memory subsystem: this allocates an LFB for the entire line, if one is not already allocated. This is a so-called request for ownership (RFO), which means that the memory hierarchy should return the line in an exclusive state suitable for modification, as opposed to a shared state suitable only for reading (this invalidates copies of the line present in any other private caches).

An RFO to convert Shared to Exclusive still has to wait for a response to make sure all other caches have invalidated their copies. The response to such an invalidate doesn't need to include a copy of the data because this cache already has one. It can still be called an RFO; the important part is gaining ownership before modifying a line. 6. In the miss scenario the LFB eventually comes back with the full contents of the line, which is committed to the L1 and the pending store can now commit³.

This is a rough approximation of the process. Some details may differ on some or all chips, including details which are not well understood.

As one example, in the above order, the store miss lines are not fetched until the store reaches the head of the store queue. In reality, the store subsystem may implement a type of RFO prefetch where the store queue is examined for upcoming stores and if the lines aren't present in L1, a request is started early (the actual visible commit to L1 still has to happen in order, on x86, or at least "as if" in order).

So the request and LFB use may occur as early as when step 3 completes (if RFO prefetch applies only after a store retires), or perhaps even as early as when 2.2 completes, if junior stores are subject to prefetch.

As another example, step 6 describes the line coming back from the memory hierarchy and being committed to the L1, then the store commits. It is possible that the pending store is actually merged instead with the returning data and then that is written to L1. It is also possible that the store can leave the store buffer even in the miss case and simply wait in the LFB, freeing up some store buffer entries.

¹ In the case of stores that hit in the L1 cache, there is a suggestion that the LFBs are actually involved: that each store actually enters a combining buffer (which may just be an LFB) prior to being committed to the cache, such that a series of stores targeting the same cache line get combined in the cache and only need to access the L1 once. This isn't proven but in any case it is not really part of the main use of LFBs (more obvious from the fact we can't even really tell if it is happening or not).

² The buffers that hold stores before and retirement might be two entirely different structures, with different sizes and behaviors, but here we'll refer to them as one structure.

³ The described scenarios involves the store that misses waiting at the head of the store buffer until the associated line returns. An alternate scenario is that the store data is written into the LFB used for the request, and the store buffer entry can be freed. This potentially allows some subsequent stores to be processed while the miss is in progress, subject to the strict x86 ordering requirements. This could increase store MLP.

answered Nov 08 '22 20:11

BeeOnRope

Related questions
                            
                                AVX2, How to Efficiently Load Four Integers to Even Indices of a 256 Bit Register and Copy to Odd Indices?
                            
                                Opposite of cache prefetch hint
                            
                                How can I determine what architectures gcc supports?
                            
                                How to convert 32-bit float to 8-bit signed char? (4:1 packing of int32 to int8 __m256i)
                            
                                Does aligning memory on particular address boundaries in C/C++ still improve x86 performance?
                            
                                Why is POP slow when using register R12?
                            
                                Do x86/x64 chips still use microprogramming?
                            
                                How many byes is each instruction compiled to in x86 assembly?
                            
                                Using AVX instructions disables exp() optimization?
                            
                                Why makecontext does not work with pthreads
                            
                                How to calculate MIPS of my processor?
                            
                                x86 Can push/pop be less than 4 bytes? [duplicate]
                            
                                How to compile this program with inline asm?
                            
                                What is the difference between MOVDQA and MOVNTDQA, and VMOVDQA and VMOVNTDQ for WB/WC marked region?
                            
                                AVX2 VPSHUFB emulation in AVX
                            
                                What comes after QWORD?
                            
                                What does F in FWORD stand for?
                            
                                Creating a C function without compiler generated prologue/epilogue & RET instruction?
                            
                                Does a hyper-threaded core share MMU and TLB?
                            
                                Difference between .dynamic .dynsym and .dynstr in an ELF executable

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How do the store buffer and Line Fill Buffer interact with each other?

Tags:

cpu-architecture

x86

cpu-cache

micro-architecture

cpu-mds

Questions

Daniel Näslund

People also ask

1 Answers

BeeOnRope

Recent Activity

Donate For Us