I did a test with this
for (i32 i = 0; i < 0x800000; ++i)
{
// Hopefully this can disable hardware prefetch
i32 k = (i * 997 & 0x7FFFFF) * 0x40;
_mm_prefetch(data + ((i + 1) * 997 & 0x7FFFFF) * 0x40, _MM_HINT_NTA);
for (i32 j = 0; j < 0x40; j += 0x10)
{
//__m128 v = _mm_castsi128_ps(_mm_stream_load_si128((__m128i *)(data + k + j)));
__m128 v = _mm_load_ps((float *)(data + k + j));
a_single_chain_computation
//_mm_stream_ps((float *)(data2 + k + j), v);
_mm_store_ps((float *)(data2 + k + j), v);
}
}
Results are weird.
a_single_chain_computation
takes, the load latency is not hidden.v = _mm_mul_ps(v, v)
, prefetching saves about 0.60 - 0.57 = 0.03s. And with 16 v = _mm_mul_ps(v, v)
, it saves about 1.1 - 0.75 = 0.35s. WHY?)If the prefetched data is not subsequently used by the data consumer, the extra cost of prefetching normally reduces performance. Only in over-provisioned systems, can prefetching with low predictive accuracy improve performance.
Prefetching allows a browser to silently fetch the necessary resources needed to display content that a user might access in the near future. The browser is able to store these resources in its cache enabling it to deliver the requested data faster.
You want to prefetch once per 64B cache line, and you'll need to tune how far ahead to prefetch. e.g. _mm_prefetch((char*)(A+64), _MM_HINT_NTA); and the same for B would prefetch 16*64 = 1024 bytes head of where you're loading, allowing for hiding some of the latency of a cache miss but still easily fitting in L1D.
Cache prefetching is a technique used by computer processors to boost execution performance by fetching instructions or data from their original storage in slower memory to a faster local memory before it is actually needed (hence the term 'prefetch').
You need to separate two different things here (which unfortunately have a similar name) :
Non-temporal prefetching - This would prefetch the line, but write it as the least recently used one when it fills the caches, so it would be the first in line for eviction when you next use the same set. That leaves you enough time to actually use it (unless you're very unlucky), but wouldn't waste more than a single way out of that set, since the next prefetch to come along would just replace it. By the way, regarding your comments above - every prefetch would pollute the L3 cache, it's inclusive so you can't get away without it.
Non-temporal (streaming) loads/stores - this also won't pollute the caches, but using a completely different mechanism of making them uncacheable (as well as write combining). This would indeed have a penalty on performance even if you really don't need these lines ever again, since a cacheable write has the luxury of staying buffered in the cache until evicted, so you don't have to write it out right away. With uncacheables you do, and in some scenarios it might interfere with your mem BW. On the other hand you get the benefit of write-combining and weak ordering which may give you some edge is several cases. The bottom line here is that you should use it only when it helps, don't assume it magically improves performance (Nothing does that nowadays..)
Regarding your questions -
your prefetching should work, but it's not early enough to make an impact. try replacing i+1
with a larger number. Actually, maybe even do a sweep, would be interesting to see how many elements in advance you should peek.
i'd guess this is same as 1 - with 16 muls your iteration is long enough for the prefetch to work
As I said - your stores won't have the benefit of buffering in the lower level caches, and would have to get flushed to memory. That's the downside of streaming stores. it's implementation specific of course, so it might improve, but at the moment it's not always effective.
If your computation chain is very short and if you're reading memory sequentially then the CPU will prefetch well on its own and actually work faster since its decoder has less work to do.
Streaming loads and stores are good only if you don't plan to access this memory in the near future. They are mainly aimed at uncached write back (WB) memory that's usually found when dealing with graphic surfaces. Explicit prefecthing may work well on one architecture (CPU model) and have a negative effect on other models so use them as a last resort option when optimizing.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With