Background: I've implemented a stochastic algorithm that requires random ordering for best convergence. Doing so obviously destroys memory locality, however. I've found that by prefetching the next iteration's data, the performance drop is minimized.
I can prefetch n cache lines using _mm_prefetch
in a simple, mostly OS+compiler-portable fashion - but what's the length of a cache line? Right now, I'm using a hardcoded value of 64, which seems to be the norm nowadays on x64 processors - but I don't know how to detect this at runtime, and a question about this last year found no simple solution.
I've seen GetLogicalProcessorInformation on windows but I'm leery of using such a complex API for something so simple, and that won't work on macs or linux anyhow.
Perhaps there's some entirely other API/intrinsic that could prefetch a memory region identified in terms of bytes (or words, or whatever) and allows me to prefetch without knowing the cache line length?
Basically, is there a reasonable alternative to _mm_prefetch
with #define CACHE_LINE_LEN 64
?
Only in over-provisioned systems, can prefetching with low predictive accuracy improve performance. However, the data cache is obviously under-provisioned as it can keep only a subset of the data-set. The prefetched data typically shares the cache space with demand-paged data.
Cache prefetching is a technique used by computer processors to boost execution performance by fetching instructions or data from their original storage in slower memory to a faster local memory before it is actually needed (hence the term 'prefetch').
In computer architecture, prefetching refers to the retrieving and storing of data into the buffer memory (cache) before the processor requires the data. When the processor wants to process the data, it is readily available and can be processed within a very short period of time.
A hardware prefetcher is a data prefetching technique that is implemented as a hardware component in a processor. Any other prefetching technique is a nonhardware prefetcher. Fig. 1 shows a classification of data prefetching techniques.
There's a question asking just about the same thing here. You can read it from the CPUID if you feel like delving into some assembly. You'll have to write platform specific code for this of course.
You're probably already familiar with Agner Fog's manuals for optimization which gives the cache information for many popular processors. If you are able to determine the expected CPU's you'll encounter you can just hard-code the cache line sizes and look up the CPU vendor information to set the line size.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With