If one has a 64 byte buffer that is heavily read/written to then it's likely that it'll be kept in L1; but is there any way to force that behaviour?
As in, give one core exclusive access to those 64 bytes and tell it not to sync the data with other cores nor the memory controller so that those 64 bytes always live in one core's L1 regardless of whether or not the CPU thinks it's used often enough.
No, x86 doesn't let you do this. You can force evict with clfushopt
, or (on upcoming CPUs) for just write-back without evict with clwb
, but you can't pin a line in cache or disable coherency.
You can put the whole CPU (or a single core?) into cache-as-RAM (aka no-fill) mode to disable sync with the memory controller, and disable ever writing back the data. Cache-as-Ram (no fill mode) Executable Code. It's typically used by BIOS / firmware in early boot before configuring the memory controllers. It's not available on a per-line basis, and is almost certainly not practically useful here. Fun fact: leaving this mode is one of the use-cases for invd
, which drops cached data without writeback, as opposed to wbinvd
.
I'm not sure if no-fill mode prevents eviction from L1d to L3 or whatever; or if data is just dropped on eviction. So you'd just have to avoid accessing more than 7 other cache lines that alias the one you care about in your L1d, or the equivalent for L2/L3.
Being able to force one core to hang on to a line of L1d indefinitely and not respond to MESI requests to write it back / share it would make the other cores vulnerable to lockups if they ever touched that line. So obviously if such a feature existed, it would require kernel mode. (And with HW virtualization, require hypervisor privilege.) It could also block hardware DMA (because modern x86 has cache-coherent DMA).
So supporting such a feature would require lots of parts of the CPU to handle indefinite delays, where currently there's probably some upper bound, which may be shorter than a PCIe timeout, if there is such a thing. (I don't write drivers or build real hardware, just guessing about this).
As @fuz points out, a coherency-violating instruction (xdcbt
) was tried on PowerPC (in the Xbox 360 CPU), with disastrous results from mis-speculated execution of the instruction. So it's hard to implement.
If the line is frequently used, LRU replacement will keep it hot. And if it's lost from L1d at frequent enough intervals, then it will probably stay hot in L2 which is also on-core and private, and very fast, in recent designs (Intel since Nehalem). Intel's inclusive L3 on CPUs other than Skylake-AVX512 means that staying in L1d also means staying in L3.
All this means that full cache misses all the way to DRAM are very unlikely with any kind of frequency for a line that's heavily used by one core. So throughput shouldn't be a problem. I guess you could maybe want this for realtime latency, where the worst-case run time for one call of a function mattered. Dummy reads from the cache line in some other part of the code could be helpful in keeping it hot.
However, if pressure from other cores in L3 cache causes eviction of this line from L3, Intel CPUs with an inclusive L3 also have to force eviction from inner caches that still have it hot. IDK if there's any mechanism to let L3 know that a line is heavily used in a core's L1d, because that doesn't generate any L3 traffic.
I'm not aware of this being much of a problem in real code. L3 is highly associative (like 16 or 24 way), so it takes a lot of conflicts before you'd get an eviction. L3 also uses a more complex indexing function (like a real hash function, not just modulo by taking a contiguous range of bits). In IvyBridge and later, it also uses an adaptive replacement policy to mitigate eviction from touching a lot of data that won't be reused often. http://blog.stuffedcow.net/2013/01/ivb-cache-replacement/.
See also Which cache mapping technique is used in intel core i7 processor?
@AlexisWilke points out that you could maybe use vector register(s) instead of a line of cache, for some use-cases. Using ymm registers as a "memory-like" storage location. You could globally dedicate some vector regs to this purpose. To get this in gcc-generated code, maybe use -ffixed-ymm8
, or declare it as a volatile global register variable. (How to inform GCC to not use a particular register)
Using ALU instructions or store-forwarding to get data to/from the vector reg will give you guaranteed latency with no possibility of data-cache misses. But code-cache misses are still a problem for extremely low latency.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With