Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to clear the upper 128 bits of __m256 value?

Tags:

c

x86

avx

simd

avx2

How can I clear the upper 128 bits of m2:

__m256i    m2 = _mm256_set1_epi32(2);
__m128i    m1 = _mm_set1_epi32(1);

m2 = _mm256_castsi128_si256(_mm256_castsi256_si128(m2));
m2 = _mm256_castsi128_si256(m1);

don't work -- Intel’s documentation for the _mm256_castsi128_si256 intrinsic says that “the upper bits of the resulting vector are undefined”. At the same time I can easily do it in assembly:

VMOVDQA xmm2, xmm2  //zeros upper ymm2
VMOVDQA xmm2, xmm1

Of course I'd not like to use "and" or _mm256_insertf128_si256() and such.

like image 698
seda Avatar asked Jan 27 '14 15:01

seda


3 Answers

A new intrinsic function has been added for solving this problem:

m2 = _mm256_zextsi128_si256(m1);

https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm256_zextsi128_si256&expand=6177,6177

This function doesn't produce any code if the upper half is already known to be zero, it just makes sure the upper half is not treated as undefined.

like image 123
A Fog Avatar answered Nov 10 '22 07:11

A Fog


Update: there's now a __m128i _mm256_zextsi128_si256(__m128i) intrinsic; see Agner Fog's answer. The rest of the answer below is only relevant for old compilers that don't support this intrinsic, and where there's no efficient, portable solution.


Unfortunately, the ideal solution will depend on which compiler you are using, and on some of them, there is no ideal solution.

There are several basic ways that we could write this:

Version A:

ymm = _mm256_set_m128i(_mm_setzero_si128(), _mm256_castsi256_si128(ymm));

Version B:

ymm = _mm256_blend_epi32(_mm256_setzero_si256(),
                         ymm,
                         _MM_SHUFFLE(0, 0, 3, 3));

Version C:

ymm = _mm256_inserti128_si256(_mm256_setzero_si256(),
                              _mm256_castsi256_si128(ymm),
                              0);

Each of these do precisely what we want, clearing the upper 128 bits of a 256-bit YMM register, so any of them could safely be used. But which is the most optimal? Well, that depends on which compiler you are using...

GCC:

Version A: Not supported at all because GCC lacks the _mm256_set_m128i intrinsic. (Could be simulated, of course, but that would be done using one of the forms in "B" or "C".)

Version B: Compiled to inefficient code. Idiom is not recognized and intrinsics are translated very literally to machine-code instructions. A temporary YMM register is zeroed using VPXOR, and then that is blended with the input YMM register using VPBLENDD.

Version C: Ideal. Although the code looks kind of scary and inefficient, all versions of GCC that support AVX2 code generation recognize this idiom. You get the expected VMOVDQA xmm?, xmm? instruction, which implicitly clears the upper bits.

Prefer Version C!

Clang:

Version A: Compiled to inefficient code. A temporary YMM register is zeroed using VPXOR, and then that is inserted into the temporary YMM register using VINSERTI128 (or the floating-point forms, depending on version and options).

Version B & C: Also compiled to inefficient code. A temporary YMM register is again zeroed, but here, it is blended with the input YMM register using VPBLENDD.

Nothing ideal!

ICC:

Version A: Ideal. Produces the expected VMOVDQA xmm?, xmm? instruction.

Version B: Compiled to inefficient code. Zeros a temporary YMM register, and then blends zeros with the input YMM register (VPBLENDD).

Version C: Also compiled to inefficient code. Zeros a temporary YMM register, and then uses VINSERTI128 to insert zeros into the temporary YMM register.

Prefer Version A!

MSVC:

Version A and C: Compiled to inefficient code. Zeroes a temporary YMM register, and then uses VINSERTI128 (A) or VINSERTF128 (C) to insert zeros into the temporary YMM register.

Version B: Also compiled to inefficient code. Zeros a temporary YMM register, and then blends this with the input YMM register using VPBLENDD.

Nothing ideal!


In conclusion, then, it is possible to get GCC and ICC to emit the ideal VMOVDQA instruction, if you use the right code sequence. But, I can't see any way to get either Clang or MSVC to safely emit a VMOVDQA instruction. These compilers are missing the optimization opportunity.

So, on Clang and MSVC, we have the choice between XOR+blend and XOR+insert. Which is better? We turn to Agner Fog's instruction tables (spreadsheet version also available):

On AMD's Ryzen architecture: (Bulldozer-family is similar for the AVX __m256 equivalents of these, and for AVX2 on Excavator):

  Instruction   | Ops | Latency | Reciprocal Throughput |   Execution Ports
 ---------------|-----|---------|-----------------------|---------------------
   VMOVDQA      |  1  |    0    |          0.25         |   0 (renamed)
   VPBLENDD     |  2  |    1    |          0.67         |   3
   VINSERTI128  |  2  |    1    |          0.67         |   3

Agner Fog seems to have missed some AVX2 instructions in the Ryzen section of his tables. See this AIDA64 InstLatX64 result for confirmation that VPBLENDD ymm performs the same as VPBLENDW ymm on Ryzen, rather than being the same as VBLENDPS ymm (1c throughput from 2 uops that can run on 2 ports).

See also an Excavator / Carrizo InstLatX64 showing that VPBLENDD and VINSERTI128 have equal performance there (2 cycle latency, 1 per cycle throughput). Same for VBLENDPS/VINSERTF128.

On Intel architectures (Haswell, Broadwell, and Skylake):

  Instruction   | Ops | Latency | Reciprocal Throughput |   Execution Ports
 ---------------|-----|---------|-----------------------|---------------------
   VMOVDQA      |  1  |   0-1   |          0.33         |   3 (may be renamed)
   VPBLENDD     |  1  |    1    |          0.33         |   3
   VINSERTI128  |  1  |    3    |          1.00         |   1

Obviously, VMOVDQA is optimal on both AMD and Intel, but we already knew that, and it doesn't seem to be an option on either Clang or MSVC until their code generators are improved to recognize one of the above idioms or an additional intrinsic is added for this precise purpose.

Luckily, VPBLENDD is at least as good as VINSERTI128 on both AMD and Intel CPUs. On Intel processors, VPBLENDD is a significant improvement over VINSERTI128. (In fact, it's nearly as good as VMOVDQA in the rare case where the latter cannot be renamed, except for needing an all-zero vector constant.) Prefer the sequence of intrinsics that results in a VPBLENDD instruction if you can't coax your compiler to use VMOVDQA.

If you need a floating-point __m256 or __m256d version of this, the choice is more difficult. On Ryzen, VBLENDPS has 1c throughput, but VINSERTF128 has 0.67c. On all other CPUs (including AMD Bulldozer-family), VBLENDPS is equal or better. It's much better on Intel (same as for integer). If you're optimizing specifically for AMD, you may need to do more tests to see which variant is fastest in your particular sequence of code, otherwise blend. It's only a tiny bit worse on Ryzen.

In summary, then, targeting generic x86 and supporting as many different compilers as possible, we can do:

#if (defined _MSC_VER)

    ymm = _mm256_blend_epi32(_mm256_setzero_si256(),
                             ymm,
                             _MM_SHUFFLE(0, 0, 3, 3));

#elif (defined __INTEL_COMPILER)

    ymm = _mm256_set_m128i(_mm_setzero_si128(), _mm256_castsi256_si128(ymm));

#elif (defined __GNUC__)

    // Intended to cover GCC and Clang.
    ymm = _mm256_inserti128_si256(_mm256_setzero_si256(),
                                  _mm256_castsi256_si128(ymm),
                                  0);

#else
    #error "Unsupported compiler: need to figure out optimal sequence for this compiler."
#endif

See this and versions A,B, and C separately on the Godbolt compiler explorer.

Perhaps you could build on this to define your own macro-based intrinsic until something better comes down the pike.

like image 7
Cody Gray Avatar answered Nov 10 '22 06:11

Cody Gray


See what your compiler generates for this:

__m128i m1 = _mm_set1_epi32(1);
__m256i m2 = _mm256_set_m128i(_mm_setzero_si128(), m1);

or alternatively this:

__m128i m1 = _mm_set1_epi32(1);
__m256i m2 = _mm256_setzero_si256();
m2 = _mm256_inserti128_si256 (m2, m1, 0);

The version of clang I have here seems to generate the same code for either (vxorps + vinsertf128), but YMMV.

like image 4
Paul R Avatar answered Nov 10 '22 07:11

Paul R