Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the floating-point (__m256d) version of the non-temporal streaming load intrinsic (_mm256_stream_load_si256)?

In AVX/AVX2 I could only find _mm256_stream_load_si256() , which is for __m256i. Is there no way to stream-load __m256d and why? (I would like to load it without polluting CPU cache)

Is there any obstacle for doing the following (aggressive casting)?

__m256d *pDest = /* ... */;
__m256d *pSrc = /* ... */;

/* ... */

const __m256i iWeight = _mm256_stream_load_si256(reinterpret_cast<const __m256i*>(pSrc));
const __m256d prior = _mm256_div_pd(*reinterpret_cast<const __m256d*>(&iWeight), divisor);
_mm256_stream_pd(reinterpret_cast<double*>(pDest), prior);
like image 292
Serge Rogatch Avatar asked Jul 04 '17 08:07

Serge Rogatch


People also ask

What is non-temporal hint?

The non-temporal hint is implemented by using a write combining (WC) memory type protocol when reading the data from memory. Using this protocol, the processor does not read the data into the cache hierarchy, nor does it fetch the corresponding cache line from memory into the cache hierarchy.

What is a non-temporal write?

Non-temporal in this context means the data will not be reused soon, so there is no reason to cache it. These non-temporal write operations do not read a cache line and then modify it; instead, the new content is directly written to memory.

What is a non-temporal store?

“Non-temporal store” means that the data being stored is not going to be read again soon (i.e., no “temporal locality”). So there is no benefit to keeping the data in the processor's cache(s), and there may be a penalty if the stored data displaces other useful data from the cache(s).


1 Answers

The _mm256_stream_load_si256() intrinsic corresponds to the (V)MOVNTDQA instruction. This is the only non-temporal load instruction, so this is the one you have to use, even when you are loading floating-point data.

(The other three non-temporal instructions only do stores: (V)MOVNTDQ (_mm256_stream_si256) is for double quadword integers, (V)MOVNTPS (_mm256_stream_ps) is for packed single-precision floating-point values, and (V)MOVNTPD (_mm256_stream_pd) is for packed double-precision floating-point values.)

The cast from __m256i* to __m256d*, and vice versa, is safe. These are just bits, and they're all stored in YMM registers. I've never seen a compiler that had trouble with these types of casts. Probably should check the resulting assembly code to be sure that it's not doing something funky, though!

The only time it would matter is on certain processors, where there is a domain-crossing penalty when you mix floating-point SIMD instructions with integer SIMD instructions. But since the only NT load is in the integer domain, you really have no choice here.

Note that all non-temporal instructions (loads and stores) require aligned addresses!

like image 116
Cody Gray Avatar answered Nov 15 '22 13:11

Cody Gray