I'm writing code using the C intrinsics for Intel's AVX instructions. If I have a packed double vector (a __m256d
), what would be the most efficient way (i.e. the least number of operations) to store each of them to a different place in memory (i.e. I need to fan them out to different locations such that they are no longer packed)? Pseudocode:
__m256d *src;
double *dst;
int dst_dist;
dst[0] = src[0];
dst[dst_dist] = src[1];
dst[2 * dst_dist] = src[2];
dst[3 * dst_dist] = src[3];
Using SSE, I could do this with __m128
types using the _mm_storel_pi
and _mm_storeh_pi
intrinsics. I've not been able to find anything similar for AVX that allows me to store the individual 64-bit pieces to memory. Does one exist?
The latest Intel® Architecture Instruction Set Extensions Programming Reference includes the definition of Intel® Advanced Vector Extensions 512 (Intel® AVX-512) instructions. These instructions represent a significant leap to 512-bit SIMD support.
Roughly, for Intel AVX, any multiple of 32-bit or 64-bit floating-point type that adds to 128 or 256 bits is allowed as well as multiples of any integer type that adds to 128 bits. Figure 2. Intel® AVX and Intel® SSE data types
Intel® Advanced Vector Extensions (Intel® AVX) is a set of instructions for doing Single Instruction Multiple Data (SIMD) operations on Intel® architecture CPUs. These instructions extend previous SIMD offerings (MMX™ instructions and Intel® Streaming SIMD Extensions (Intel® SSE)) by adding the following new features:
Intel® AVX Intrinsics Data Types Type Meaning __m256 256 -bit as eight single precision floating point values, representing a YMM register or memory location __m256d 256-bit as four double precision floating point values, representing a YMM register or memory location __m256i 256-bit as integers, (bytes, words, etc.)
You can do it with a couple of extract instrinsics: (warning: untested)
__m256d src = ... // data
__m128d a = _mm256_extractf128_pd(src, 0);
__m128d b = _mm256_extractf128_pd(src, 1);
_mm_storel_pd(dst + 0*dst_dist, a);
_mm_storeh_pd(dst + 1*dst_dist, a);
_mm_storel_pd(dst + 2*dst_dist, b);
_mm_storeh_pd(dst + 3*dst_dist, b);
What you want is the gather/scatter instructions in AVX2... But that's still a few years down the road.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With