How can I clear the upper 128 bits of m2:
__m256i m2 = _mm256_set1_epi32(2);
__m128i m1 = _mm_set1_epi32(1);
m2 = _mm256_castsi128_si256(_mm256_castsi256_si128(m2));
m2 = _mm256_castsi128_si256(m1);
don't work -- Intel’s documentation for the _mm256_castsi128_si256
intrinsic says that “the upper bits of the resulting vector are undefined”.
At the same time I can easily do it in assembly:
VMOVDQA xmm2, xmm2 //zeros upper ymm2
VMOVDQA xmm2, xmm1
Of course I'd not like to use "and" or _mm256_insertf128_si256()
and such.
A new intrinsic function has been added for solving this problem:
m2 = _mm256_zextsi128_si256(m1);
https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm256_zextsi128_si256&expand=6177,6177
This function doesn't produce any code if the upper half is already known to be zero, it just makes sure the upper half is not treated as undefined.
Update: there's now a __m128i _mm256_zextsi128_si256(__m128i)
intrinsic; see Agner Fog's answer. The rest of the answer below is only relevant for old compilers that don't support this intrinsic, and where there's no efficient, portable solution.
Unfortunately, the ideal solution will depend on which compiler you are using, and on some of them, there is no ideal solution.
There are several basic ways that we could write this:
Version A:
ymm = _mm256_set_m128i(_mm_setzero_si128(), _mm256_castsi256_si128(ymm));
Version B:
ymm = _mm256_blend_epi32(_mm256_setzero_si256(),
ymm,
_MM_SHUFFLE(0, 0, 3, 3));
Version C:
ymm = _mm256_inserti128_si256(_mm256_setzero_si256(),
_mm256_castsi256_si128(ymm),
0);
Each of these do precisely what we want, clearing the upper 128 bits of a 256-bit YMM register, so any of them could safely be used. But which is the most optimal? Well, that depends on which compiler you are using...
GCC:
Version A: Not supported at all because GCC lacks the _mm256_set_m128i
intrinsic. (Could be simulated, of course, but that would be done using one of the forms in "B" or "C".)
Version B: Compiled to inefficient code. Idiom is not recognized and intrinsics are translated very literally to machine-code instructions. A temporary YMM register is zeroed using VPXOR
, and then that is blended with the input YMM register using VPBLENDD
.
Version C: Ideal. Although the code looks kind of scary and inefficient, all versions of GCC that support AVX2 code generation recognize this idiom. You get the expected VMOVDQA xmm?, xmm?
instruction, which implicitly clears the upper bits.
Prefer Version C!
Clang:
Version A: Compiled to inefficient code. A temporary YMM register is zeroed using VPXOR
, and then that is inserted into the temporary YMM register using VINSERTI128
(or the floating-point forms, depending on version and options).
Version B & C: Also compiled to inefficient code. A temporary YMM register is again zeroed, but here, it is blended with the input YMM register using VPBLENDD
.
Nothing ideal!
ICC:
Version A: Ideal. Produces the expected VMOVDQA xmm?, xmm?
instruction.
Version B: Compiled to inefficient code. Zeros a temporary YMM register, and then blends zeros with the input YMM register (VPBLENDD
).
Version C: Also compiled to inefficient code. Zeros a temporary YMM register, and then uses VINSERTI128
to insert zeros into the temporary YMM register.
Prefer Version A!
MSVC:
Version A and C: Compiled to inefficient code. Zeroes a temporary YMM register, and then uses VINSERTI128
(A) or VINSERTF128
(C) to insert zeros into the temporary YMM register.
Version B: Also compiled to inefficient code. Zeros a temporary YMM register, and then blends this with the input YMM register using VPBLENDD
.
Nothing ideal!
In conclusion, then, it is possible to get GCC and ICC to emit the ideal VMOVDQA
instruction, if you use the right code sequence. But, I can't see any way to get either Clang or MSVC to safely emit a VMOVDQA
instruction. These compilers are missing the optimization opportunity.
So, on Clang and MSVC, we have the choice between XOR+blend and XOR+insert. Which is better? We turn to Agner Fog's instruction tables (spreadsheet version also available):
On AMD's Ryzen architecture: (Bulldozer-family is similar for the AVX __m256
equivalents of these, and for AVX2 on Excavator):
Instruction | Ops | Latency | Reciprocal Throughput | Execution Ports
---------------|-----|---------|-----------------------|---------------------
VMOVDQA | 1 | 0 | 0.25 | 0 (renamed)
VPBLENDD | 2 | 1 | 0.67 | 3
VINSERTI128 | 2 | 1 | 0.67 | 3
Agner Fog seems to have missed some AVX2 instructions in the Ryzen section of his tables. See this AIDA64 InstLatX64 result for confirmation that VPBLENDD ymm
performs the same as VPBLENDW ymm
on Ryzen, rather than being the same as VBLENDPS ymm
(1c throughput from 2 uops that can run on 2 ports).
See also an Excavator / Carrizo InstLatX64 showing that VPBLENDD
and VINSERTI128
have equal performance there (2 cycle latency, 1 per cycle throughput). Same for VBLENDPS
/VINSERTF128
.
On Intel architectures (Haswell, Broadwell, and Skylake):
Instruction | Ops | Latency | Reciprocal Throughput | Execution Ports
---------------|-----|---------|-----------------------|---------------------
VMOVDQA | 1 | 0-1 | 0.33 | 3 (may be renamed)
VPBLENDD | 1 | 1 | 0.33 | 3
VINSERTI128 | 1 | 3 | 1.00 | 1
Obviously, VMOVDQA
is optimal on both AMD and Intel, but we already knew that, and it doesn't seem to be an option on either Clang or MSVC until their code generators are improved to recognize one of the above idioms or an additional intrinsic is added for this precise purpose.
Luckily, VPBLENDD
is at least as good as VINSERTI128
on both AMD and Intel CPUs. On Intel processors, VPBLENDD
is a significant improvement over VINSERTI128
. (In fact, it's nearly as good as VMOVDQA
in the rare case where the latter cannot be renamed, except for needing an all-zero vector constant.) Prefer the sequence of intrinsics that results in a VPBLENDD
instruction if you can't coax your compiler to use VMOVDQA
.
If you need a floating-point __m256
or __m256d
version of this, the choice is more difficult. On Ryzen, VBLENDPS
has 1c throughput, but VINSERTF128
has 0.67c. On all other CPUs (including AMD Bulldozer-family), VBLENDPS
is equal or better. It's much better on Intel (same as for integer). If you're optimizing specifically for AMD, you may need to do more tests to see which variant is fastest in your particular sequence of code, otherwise blend. It's only a tiny bit worse on Ryzen.
In summary, then, targeting generic x86 and supporting as many different compilers as possible, we can do:
#if (defined _MSC_VER)
ymm = _mm256_blend_epi32(_mm256_setzero_si256(),
ymm,
_MM_SHUFFLE(0, 0, 3, 3));
#elif (defined __INTEL_COMPILER)
ymm = _mm256_set_m128i(_mm_setzero_si128(), _mm256_castsi256_si128(ymm));
#elif (defined __GNUC__)
// Intended to cover GCC and Clang.
ymm = _mm256_inserti128_si256(_mm256_setzero_si256(),
_mm256_castsi256_si128(ymm),
0);
#else
#error "Unsupported compiler: need to figure out optimal sequence for this compiler."
#endif
See this and versions A,B, and C separately on the Godbolt compiler explorer.
Perhaps you could build on this to define your own macro-based intrinsic until something better comes down the pike.
See what your compiler generates for this:
__m128i m1 = _mm_set1_epi32(1);
__m256i m2 = _mm256_set_m128i(_mm_setzero_si128(), m1);
or alternatively this:
__m128i m1 = _mm_set1_epi32(1);
__m256i m2 = _mm256_setzero_si256();
m2 = _mm256_inserti128_si256 (m2, m1, 0);
The version of clang I have here seems to generate the same code for either (vxorps
+ vinsertf128
), but YMMV.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With