In AVX/AVX2 I could only find _mm256_stream_load_si256()
, which is for __m256i
. Is there no way to stream-load __m256d
and why? (I would like to load it without polluting CPU cache)
Is there any obstacle for doing the following (aggressive casting)?
__m256d *pDest = /* ... */;
__m256d *pSrc = /* ... */;
/* ... */
const __m256i iWeight = _mm256_stream_load_si256(reinterpret_cast<const __m256i*>(pSrc));
const __m256d prior = _mm256_div_pd(*reinterpret_cast<const __m256d*>(&iWeight), divisor);
_mm256_stream_pd(reinterpret_cast<double*>(pDest), prior);
The non-temporal hint is implemented by using a write combining (WC) memory type protocol when reading the data from memory. Using this protocol, the processor does not read the data into the cache hierarchy, nor does it fetch the corresponding cache line from memory into the cache hierarchy.
Non-temporal in this context means the data will not be reused soon, so there is no reason to cache it. These non-temporal write operations do not read a cache line and then modify it; instead, the new content is directly written to memory.
“Non-temporal store” means that the data being stored is not going to be read again soon (i.e., no “temporal locality”). So there is no benefit to keeping the data in the processor's cache(s), and there may be a penalty if the stored data displaces other useful data from the cache(s).
The _mm256_stream_load_si256()
intrinsic corresponds to the (V)MOVNTDQA
instruction. This is the only non-temporal load instruction, so this is the one you have to use, even when you are loading floating-point data.
(The other three non-temporal instructions only do stores: (V)MOVNTDQ
(_mm256_stream_si256
) is for double quadword integers, (V)MOVNTPS
(_mm256_stream_ps
) is for packed single-precision floating-point values, and (V)MOVNTPD
(_mm256_stream_pd
) is for packed double-precision floating-point values.)
The cast from __m256i*
to __m256d*
, and vice versa, is safe. These are just bits, and they're all stored in YMM
registers. I've never seen a compiler that had trouble with these types of casts. Probably should check the resulting assembly code to be sure that it's not doing something funky, though!
The only time it would matter is on certain processors, where there is a domain-crossing penalty when you mix floating-point SIMD instructions with integer SIMD instructions. But since the only NT load is in the integer domain, you really have no choice here.
Note that all non-temporal instructions (loads and stores) require aligned addresses!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With