Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Force a migration of a cache line to another core

In C++ (using any of the low level intrinsics available on the platform) for x86 hardware (say Intel Skylake for example), is it possible to send a cacheline to another core without forcing the thread on that core to load the line explicitly?

My usecase is in a concurrent data-structure. In this, for some cases a core goes through some places in memory that might be owned by some other core(s) while probing for spots. The threads on those cores are typically are blocked on a condition variable, so they have some spare cycles where they can run additional "useful work". One example of "useful work" here might be that they stream the data to the other core that will load them in the future so the loading core doesn't have to wait for the line to come into it's cache before processing it. Is there some intrinsic/instruction available on x86 hardware where this can be possible?


A __builtin_prefetch didn't work really well because for some reason, it ends up adding that latency back to the code doing the loading :( Maybe the strides were not well configured, but I haven't been able to get good strides so far. This might be better handled, and deterministically from the other cores that know their lines might be loaded eventually.

like image 840
Curious Avatar asked Jan 27 '23 00:01

Curious


1 Answers

There is no "push"; a cache line enters L1d on a physical core only after that core requests it. (Because of a load, SW prefetch, or even HW prefetch.)

2 logical cores can share the same physical core, in case that helps: it might be less horrible to wake up a prefetch-assistant thread to prime the case if latency of some future load is far more important than throughput. I'm picturing having the writer use a condition variable or send a POSIX signal, or write to a pipe, or something that will result in OS-assisted wakeup of another thread whose CPU affinity is set to one or both of the logical cores that some other thread you care about is also pinned to.


The best you can possibly do from the writer side is trigger write-back to shared (L3) cache so the other core can hit in L3 instead of finding it owned by some other core and having to wait for write-back too. (Or depending on the uarch, for direct core->core transfer)

e.g. on Ice Lake or later, use clwb to force a write-back, resulting in it being clean but still cached. (But note that forces it to go all the way to DRAM.) clwb on SKX does evict like clflushopt.

See also CPU cache inhibition where I suggested possibly using a memory region set to write-through caching, if that's possible under a mainstream OS. See also How to force cpu core to flush store buffer in c?

Or of course to pin both writer and reader to the same physical core so they communicate via L1d. But then they compete for execution resources.

like image 133
Peter Cordes Avatar answered Jan 31 '23 03:01

Peter Cordes