Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Non-temporal loads and the hardware prefetcher, do they work together?

When executing a series of _mm_stream_load_si128() calls (MOVNTDQA) from consecutive memory locations, will the hardware pre-fetcher still kick-in, or should I use explicit software prefetching (with NTA hint) in order to obtain the benefits of prefetching while still avoiding cache pollution?

The reason I ask this is because their objectives seem contradictory to me. A streaming load will fetch data bypassing the cache, while the pre-fetcher attempts to proactively fetch data into the cache.

When sequentially iterating a large data structure (processed data won't be retouched in a long while), it would make sense to me to avoid polluting the chache hierarchy, but I do not want to incur in frequent ~100 cycle penalties because the pre-fetcher is idle.

Target architecture is Intel SandyBridge

like image 314
BlueStrat Avatar asked Aug 19 '15 19:08

BlueStrat


3 Answers

I recently made some tests of the various prefetch flavors while answering another question and my findings were:

The results from using prefetchnta were consistent with the following implementation on Skylake client:

  • prefetchnta loads values into the L1 and L3 but not the L2 (in fact, it seems the line may be evicted from the L2 if it is already there).
  • It seems to load the value "normally" into L1, but in a weaker way in L3 such that it is evicted more quickly (e.g., only into a single way in the set, or with its LRU flag set such that it will be the next victim).
  • prefetchnta, like all other prefetch instructions, use a LFB entry, so they don't really help you get additional parallelism: but the NTA hint can be useful here to avoid L2 and L3 pollution.

The current optimization manual (248966-038) claims in a few places that prefetchnta does bring data into the L2, but only in one way out of the set. E.g., in 7.6.2.1 Video Encoder:

The prefetching cache management implemented for the video encoder reduces the memory traffic. The second-level cache pollution reduction is ensured by preventing single-use video frame data from entering the second-level cache. Using a non-temporal PREFETCH (PREFETCHNTA) instruction brings data into only one way of the second-level cache, thus reducing pollution of the second-level cache.

This isn't consistent with my test results on Skylake, where striding over a 64 KiB region with prefetchnta shows performance almost exactly consistent with fetching data from the L3 (~4 cycles per load, with an MLP factor of 10 and an L3 latency of about 40 cycles):

                                 Cycles       ns
         64-KiB parallel loads     1.00     0.39
    64-KiB parallel prefetcht0     2.00     0.77
    64-KiB parallel prefetcht1     1.21     0.47
    64-KiB parallel prefetcht2     1.30     0.50
   64-KiB parallel prefetchnta     3.96     1.53

Since the L2 in Skylake is 4-way, if the data was loaded into one way it should just barely stay in the L2 cache (one way of which covers 64 KiB), but the results above indicate that it doesn't.

You can run these tests on your own hardware on Linux using my uarch-bench program. Results for old systems would be particularly interesting.

Skylake Server (SKLX)

The reported behavior of prefetchnta on Skylake Server, which has a different L3 cache architecture, is significantly different from Skylake client. In particular, user Mysticial reports that lines fetched using prefetchnta are not available in any cache level and must be re-read from DRAM once they are evicted from L1.

The mostly likely explanation is that they never entered L3 at all as a result of the prefetchnta - this is likely since in Skylake server the L3 is a non-inclusive shared victim cache for the private L2 caches, so lines that bypass the L2 cache using prefetchnta are likely never to have a chance to enter the L3. This makes prefetchnta both more pure in function: fewer cache levels are polluted by prefetchnta requests, but also more brittle: any failure to read an nta line from L1 before it is evicted means another full roundtrip to memory: the initial request triggered by the prefetchnta is totally wasted.

like image 176
BeeOnRope Avatar answered Oct 14 '22 05:10

BeeOnRope


According to Patrick Fay (Intel)'s Nov 2011 post:, "On recent Intel processors, prefetchnta brings a line from memory into the L1 data cache (and not into the other cache levels)." He also says you need to make sure you don't prefetch too late (HW prefetch will already have pulled it in to all levels), or too early (evicted by the time you get there).


As discussed in comments on the OP, current Intel CPUs have a large shared L3 which is inclusive of all the per-core caches. This means cache-coherency traffic only has to check L3 tags to see if a cache line might be modified somewhere in a per-core L1/L2.

IDK how to reconcile Pat Fay's explanation with my understanding of cache coherency / cache heirarchy. I thought if it does go in L1, it would also have to go in L3. Maybe L1 tags have some kind of flag to say this line is weakly-ordered? My best guess is he was simplifying, and saying L1 when it actually only goes in fill buffers.

This Intel guide about working with video RAM talks about non-temporal moves using load/store buffers, rather than cache lines. (Note that this may only the case for uncacheable memory.) It doesn't mention prefetch. It's also old, predating SandyBridge. However, it does have this juicy quote:

Ordinary load instructions pull data from USWC memory in units of the same size the instruction requests. By contrast, a streaming load instruction such as MOVNTDQA will commonly pull a full cache line of data to a special "fill buffer" in the CPU. Subsequent streaming loads would read from that fill buffer, incurring much less delay.

And then in another paragraph, says typical CPUs have 8 to 10 fill buffers. SnB/Haswell still have 10 per core.. Again, note that this may only apply to uncacheable memory regions.

movntdqa on WB (write-back) memory is not weakly-ordered (see the NT loads section of the linked answer), so it's not allowed to be "stale". Unlike NT stores, neither movntdqa nor prefetchnta change the memory ordering semantics of Write-Back memory.

I have not tested this guess, but prefetchnta / movntdqa on a modern Intel CPU could load a cache line into L3 and L1, but could skip L2 (because L2 isn't inclusive or exclusive of L1). The NT hint could have an effect by placing the cache line in the LRU position of its set, where it's the next line to be evicted. (Normal cache policy inserts new lines at the MRU position, farthest from being evicted. See this article about IvB's adaptive L3 policy for more about cache insertion policy).


Prefetch throughput on IvyBridge is only one per 43 cycles, so be careful not to prefetch too much if you don't want prefetches to slow down your code on IvB. Source: Agner Fog's insn tables and microarch guide. This is a performance bug specific to IvB. On other designs, too much prefetch will just take up uop throughput that could have been useful instructions (other than harm from prefetching useless addresses).

About SW prefetching in general (not the nt kind): Linus Torvalds posted about how they rarely help in the Linux kernel, and often do more harm than good. Apparently prefetching a NULL pointer at the end of a linked-list can cause a slowdown, because it attempts a TLB fill.

like image 25
Peter Cordes Avatar answered Oct 14 '22 05:10

Peter Cordes


Note: I wrote this answer when I was less knowledgeable, but I think it's still OK and useful.

Both MOVNTDQA (on WC memory) and PREFETCHNTA do not affect or trigger any of the cache hardware prefetchers. The whole idea of the non-temporal hint is to completely avoid cache pollution or at least minimize it as much as possible.

There is only a very small number (undocumented) of buffers called streaming load buffers (these are separate from the line fill buffers and from the L1 cache) to hold cache lines fetched using MOVNTDQA. So basically you need to use what you fetch almost immediately. In addition, MOVNTDQA only works on WC memory on most Intel processors. On Intel ADL's GLC cores, MOVNTDQA on a memory location of type WB, a non-temporal protocol is used by default. The WB ordering semantics are still preserved, though, because the NT hint can never override the effective memory type in any case. This is not a breaking change and is consistent with the documentation.

The PREFETCHNTA instruction is perfect for your scenario, but you have to figure out how to use it properly in your code. From the Intel optimization manual Section 7.1:

If your algorithm is single-pass use PREFETCHNTA. If your algorithm is multi-pass use PREFETCHT0.

The PREFETCHNTA instruction offers the following benefits:

  • It fetches the particular cache line that contains the specified address into at least the L3 cache and/or potentially higher levels of the cache hierarchy (see Bee's and Peter's answer and Section 7.3.2). In every cache level that it gets cached in, it might/should/more likely be considered the first to be evicted in case there is a need to evict a line from the set. In an implementation of a single-pass algorithm (such as computing the average of a large array of numbers) that is enhanced with PREFETCHNTA, later prefetched cache lines can be placed in the same block as those lines that were also prefetched using PREFETCHNTA. So even if the total amount of data being fetched is massive, only one way of the whole cache will get affected. The data that resides in the other ways will remain cached and will be available after the algorithm terminates. But this is a double-edged sword. If two PREFETCHNTA instructions are too close to each other and if the specified addresses map to the same cache set, then only one will survive.
  • Cache lines prefetched using PREFETCHNTA are kept coherent like any other cached lines using the same hardware coherence mechanism.
  • It works on the WB, WC, and WT memory types. Most probably your data is stored in WB memory.
  • Like I said before, it does not trigger hardware prefetching. It is for this reason why it can also be used to improve the performance of irregular memory access patterns as recommended by Intel.

The thread that executes PREFETCHNTA may not be able to effectively benefit from it depending on the behavior of any other running threads on the same physical core, on other physical cores of the same processor, or on cores of other processors that share the same coherence domain. Techniques such as, pinning, priority boosting, CAT-based cache partitioning, and disabling hyperthreading may help that thread to run efficiently. Note also that PREFETCHNTA is classified as a speculative load and so it is concurrent with the three fence instructions.

like image 30
Hadi Brais Avatar answered Oct 14 '22 05:10

Hadi Brais