In AVX/AVX2 I could only find <code>_mm256_stream_load_si256()</code> , which is for <code>__m256i</code>. Is there no way to stream-load <code>__m256d</code> and why? (I would like to load it without polluting CPU cache) Is there any obstacle for doing the following (aggressive casting)? <pre class="prettyprint"><code>__m256d *pDest = /* ... */; __m256d *pSrc = /* ... */; /* ... */ const __m256i iWeight = _mm256_stream_load_si256(reinterpret_cast<const __m256i*>(pSrc)); const __m256d prior = _mm256_div_pd(*reinterpret_cast<const __m256d*>(&iWeight), divisor); _mm256_stream_pd(reinterpret_cast<double*>(pDest), prior); </code></pre>

The <code>_mm256_stream_load_si256()</code> intrinsic corresponds to the <code>(V)MOVNTDQA</code> instruction. This is the only non-temporal load instruction, so this is the one you have to use, even when you are loading floating-point data. (The other three non-temporal instructions only do stores: <code>(V)MOVNTDQ</code> (<code>_mm256_stream_si256</code>) is for double quadword integers, <code>(V)MOVNTPS</code> (<code>_mm256_stream_ps</code>) is for packed single-precision floating-point values, and <code>(V)MOVNTPD</code> (<code>_mm256_stream_pd</code>) is for packed double-precision floating-point values.) The cast from <code>__m256i*</code> to <code>__m256d*</code>, and vice versa, is safe. These are just bits, and they're all stored in <code>YMM</code> registers. I've never seen a compiler that had trouble with these types of casts. Probably should check the resulting assembly code to be sure that it's not doing something funky, though! The only time it would matter is on certain processors, where there is a domain-crossing penalty when you mix floating-point SIMD instructions with integer SIMD instructions. But since the only NT load is in the integer domain, you really have no choice here. Note that all non-temporal instructions (loads and stores) require aligned addresses!

What is the floating-point (__m256d) version of the non-temporal streaming load intrinsic (_mm256_stream_load_si256)?

Tags:

c++

x86

simd

intrinsics

avx2

In AVX/AVX2 I could only find _mm256_stream_load_si256() , which is for __m256i. Is there no way to stream-load __m256d and why? (I would like to load it without polluting CPU cache)

Is there any obstacle for doing the following (aggressive casting)?

__m256d *pDest = /* ... */;
__m256d *pSrc = /* ... */;

/* ... */

const __m256i iWeight = _mm256_stream_load_si256(reinterpret_cast<const __m256i*>(pSrc));
const __m256d prior = _mm256_div_pd(*reinterpret_cast<const __m256d*>(&iWeight), divisor);
_mm256_stream_pd(reinterpret_cast<double*>(pDest), prior);

292

asked Jul 04 '17 08:07

Serge Rogatch

1 Answers

The _mm256_stream_load_si256() intrinsic corresponds to the (V)MOVNTDQA instruction. This is the only non-temporal load instruction, so this is the one you have to use, even when you are loading floating-point data.

(The other three non-temporal instructions only do stores: (V)MOVNTDQ (_mm256_stream_si256) is for double quadword integers, (V)MOVNTPS (_mm256_stream_ps) is for packed single-precision floating-point values, and (V)MOVNTPD (_mm256_stream_pd) is for packed double-precision floating-point values.)

The cast from __m256i* to __m256d*, and vice versa, is safe. These are just bits, and they're all stored in YMM registers. I've never seen a compiler that had trouble with these types of casts. Probably should check the resulting assembly code to be sure that it's not doing something funky, though!

The only time it would matter is on certain processors, where there is a domain-crossing penalty when you mix floating-point SIMD instructions with integer SIMD instructions. But since the only NT load is in the integer domain, you really have no choice here.

Note that all non-temporal instructions (loads and stores) require aligned addresses!

116

answered Nov 15 '22 13:11

Cody Gray

Related questions
                            
                                Restbed example
                            
                                Using unspecialized templated type as template parameter [duplicate]
                            
                                Reflection and refraction impossible without recursive ray tracing?
                            
                                what is the meaning of "sw" in libswscale of ffmpeg?
                            
                                Is it possible to create mini dump file programmatically without a crash?
                            
                                Complexity of std::unordered_set iterator traversal
                            
                                Why there is only a `to_string()` for number types?
                            
                                reducing syntax "noise" without using a macro
                            
                                Syntax of final, override, const with trailing return types
                            
                                Inheritance and is_detected_v provides a strange result (C++17)
                            
                                Disambiguating list initialization for std::vector<std::string>
                            
                                enable_if to check if value type of iterator is a pair
                            
                                How can i replace element in set?
                            
                                Does guaranteed copy elision work with function parameters?
                            
                                webRTC : How to apply webRTC's VAD on audio through samples obtained from WAV file
                            
                                Using class constructor as callable object
                            
                                Comparing char in c++
                            
                                Swig typemap to pass address of variable as a parameter?
                            
                                How to make Visual Studio 2010 warn about unused variables?
                            
                                "Could not determine which "make" command to run. Check the "make" step in the build configuration." Qt creator

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With