Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Horizontal XOR in AVX

Is there a way to XOR horizontally an AVX register—specifically, to XOR the four 64-bit components of a 256-bit register?

The goal is to get the XOR of all 4 64-bit components of an AVX register. It would essentially be doing the same thing as a horizontal add (_mm256_hadd_epi32()), except that I want to XOR instead of ADD.

The scalar code is:

inline uint64_t HorizontalXor(__m256i t) {
  return t.m256i_u64[0] ^ t.m256i_u64[1] ^ t.m256i_u64[2] ^ t.m256i_u64[3];
}
like image 800
Serge Rogatch Avatar asked Jul 05 '17 21:07

Serge Rogatch


3 Answers

As stated in the comments, the fastest code very likely uses scalar operations, doing everything in the integer registers. All you need to do is extract the four packed 64-bit integers, then you have three XOR instructions, and you're done. This can be done pretty efficiently, and it leaves the result in an integer register, which is what your sample code suggests that you would want.

MSVC already generates pretty good code for the scalar function that you show as an example in the question:

inline uint64_t HorizontalXor(__m256i t) {
  return t.m256i_u64[0] ^ t.m256i_u64[1] ^ t.m256i_u64[2] ^ t.m256i_u64[3];
}

Assuming that t is in ymm1, the resulting disassembly will be something like this:

vextractf128 xmm0, ymm1, 1
vpextrq      rax,  xmm0, 1
vmovq        rcx,  xmm1
xor          rax,  rcx
vpextrq      rcx,  xmm1, 1
vextractf128 xmm0, ymm1, 1
xor          rax,  rcx
vmovq        rcx,  xmm0
xor          rax,  rcx

…with the result left in RAX. If this accurately reflects what you need (a scalar uint64_t result), then this code would be sufficient.

You can slightly improve it by using intrinsics:

inline uint64_t _mm256_hxor_epu64(__m256i x)
{
   const __m128i temp = _mm256_extracti128_si256(x, 1);
   return (uint64_t&)x
          ^ (uint64_t)(_mm_extract_epi64(_mm256_castsi256_si128(x), 1))
          ^ (uint64_t&)(temp)
          ^ (uint64_t)(_mm_extract_epi64(temp, 1));
}

Then you'll get the following disassembly (again, assuming that x is in ymm1):

vextracti128 xmm2, ymm1, 1
vpextrq      rcx,  xmm2, 1
vpextrq      rax,  xmm1, 1
xor          rax,  rcx
vmovq        rcx,  xmm1
xor          rax,  rcx
vmovq        rcx,  xmm2
xor          rax,  rcx

Notice that we were able to elide one extraction instruction, and that we've ensured VEXTRACTI128 was used instead of VEXTRACTF128 (although, this choice probably does not matter).

You'll see similar output on other compilers. For example, here's GCC 7.1 (with x assumed to be in ymm0):

vextracti128 xmm2, ymm0, 0x1
vpextrq      rax,  xmm0, 1
vmovq        rdx,  xmm2
vpextrq      rcx,  xmm2, 1
xor          rax,  rdx
vmovq        rdx,  xmm0
xor          rax,  rdx
xor          rax,  rcx

The same instructions are there, but they've been slightly reordered. The intrinsics allow the compiler's scheduler to order as it deems best. Clang 4.0 schedules them differently yet:

vmovq        rax,  xmm0
vpextrq      rcx,  xmm0, 1
xor          rcx,  rax
vextracti128 xmm0, ymm0, 1
vmovq        rdx,  xmm0
xor          rdx,  rcx
vpextrq      rax,  xmm0, 1
xor          rax,  rdx

And, of course, this ordering is always subject to change when the code gets inlined.


On the other hand, if you want the result to be in an AVX register, then you first need to decide how you want it to be stored. I guess you would just store the single 64-bit result as a scalar, something like:

inline __m256i _mm256_hxor(__m256i x)
{
   const __m128i temp = _mm256_extracti128_si256(x, 1);
   return _mm256_set1_epi64x((uint64_t&)x
                             ^ (uint64_t)(_mm_extract_epi64(_mm256_castsi256_si128(x), 1))
                             ^ (uint64_t&)(temp)
                             ^ (uint64_t)(_mm_extract_epi64(temp, 1)));
}

But now you're doing a lot of data shuffling, negating any performance boost that you would possibly see from vectorizing the code.

Speaking of which, I'm not really sure how you got yourself into a situation where you need to do horizontal operations like this in the first place. SIMD operations are designed to scale vertically, not horizontally. If you're still in the implementation phase, it may be appropriate to reconsider the design. In particular, you should be generating the 4 integer values in 4 different AVX registers, rather than packing them all into one.

If you actually want 4 copies of the result packed into an AVX register, then you could do something like this:

inline __m256i _mm256_hxor(__m256i x)
{
   const __m256i temp = _mm256_xor_si256(x,
                                         _mm256_permute2f128_si256(x, x, 1));    
   return _mm256_xor_si256(temp,
                           _mm256_shuffle_epi32(temp, _MM_SHUFFLE(1, 0, 3, 2)));
}

This still exploits a bit of parallelism by doing two XORs at once, meaning that only two XOR operations are required in all, instead of three.

If it helps to visualize it, this basically does:

   A         B         C         D           ⟵ input
  XOR       XOR       XOR       XOR
   C         D         A         B           ⟵ permuted input
=====================================
  A^C       B^D       A^C        B^D         ⟵ intermediate result
  XOR       XOR       XOR        XOR
  B^D       A^C       B^D        A^C         ⟵ shuffled intermediate result
======================================
A^C^B^D   A^C^B^D   A^C^B^D    A^C^B^D      ⟵ final result

On practically all compilers, these intrinsics will produce the following assembly code:

vperm2f128  ymm0, ymm1, ymm1, 1    ; input is in YMM1
vpxor       ymm2, ymm0, ymm1
vpshufd     ymm1, ymm2, 78
vpxor       ymm0, ymm1, ymm2

(I had come up with this on my way to bed after first posting this answer, and planned to come back and update the answer, but I see that wim beat me to the punch on posting it. Oh well, it's still a better approach than what I first had, so it still merits being included here.)

And, of course, if you wanted this in an integer register, you would just need a simple VMOVQ:

vperm2f128  ymm0, ymm1, ymm1, 1    ; input is in YMM1
vpxor       ymm2, ymm0, ymm1
vpshufd     ymm1, ymm2, 78
vpxor       ymm0, ymm1, ymm2
vmovq       rax,  xmm0

The question is, would this be faster than the scalar code above. And the answer is, yes, probably. Although you are doing the XORs using the AVX execution units, instead of the completely separate integer execution units, there are fewer AVX shuffles/permutes/extracts that need to be done, which means less overhead. So I might also have to eat my words on scalar code being the fastest implementation. But it really depends on what you're doing and how the instructions can be scheduled/interleaved.

like image 54
Cody Gray Avatar answered Nov 18 '22 16:11

Cody Gray


Vectorization is likely to be useful if the input of the horizontal xor-function is already in an AVX register, i.e. your t is the result of some SIMD computation. Otherwise, scalar code is likely to be faster, as already mentioned by @Cody Gray. Often you can do horizontal SIMD operations in about log_2(SIMD_width) 'steps'. In this case one step is a 'shuffle/permute' and a 'xor'. This is slightly more efficient than @Cody Gray 's _mm256_hxor function:

__m256i _mm256_hxor_v2(__m256i x)
{
    __m256i x0 = _mm256_permute2x128_si256(x,x,1);       // swap the 128 bit high and low lane 
    __m256i x1 = _mm256_xor_si256(x,x0);
    __m256i x2 = _mm256_shuffle_epi32(x1,0b01001110);    // swap 64 bit lanes                         
    __m256i x3 = _mm256_xor_si256(x1,x2);
    return x3;
}

This compiles to:

vperm2i128  $1, %ymm0, %ymm0, %ymm1
vpxor   %ymm1, %ymm0, %ymm0
vpshufd $78, %ymm0, %ymm1
vpxor   %ymm1, %ymm0, %ymm0


If you want the result in a scalar register:

uint64_t _mm256_hxor_v2_uint64(__m256i x)
{
    __m256i x0 = _mm256_permute2x128_si256(x,x,1);
    __m256i x1 = _mm256_xor_si256(x,x0);
    __m256i x2 = _mm256_shuffle_epi32(x1,0b01001110);
    __m256i x3 = _mm256_xor_si256(x1,x2);
    return _mm_cvtsi128_si64x(_mm256_castsi256_si128(x3)) ;
}

Which compiles to:

vperm2i128  $1, %ymm0, %ymm0, %ymm1
vpxor   %ymm1, %ymm0, %ymm0
vpshufd $78, %ymm0, %ymm1
vpxor   %ymm1, %ymm0, %ymm0
vmovq   %xmm0, %rax


Full test code:

#include <stdio.h>
#include <x86intrin.h>
#include <stdint.h>
/*  gcc -O3 -Wall -m64 -march=broadwell hor_xor.c   */
int print_vec_uint64(__m256i v);

__m256i _mm256_hxor_v2(__m256i x)
{
    __m256i x0 = _mm256_permute2x128_si256(x,x,1);
    __m256i x1 = _mm256_xor_si256(x,x0);
    __m256i x2 = _mm256_shuffle_epi32(x1,0b01001110);
    __m256i x3 = _mm256_xor_si256(x1,x2);
/* Uncomment the next few lines to print the values of the intermediate variables */ 
/*
    printf("3...0        =          3          2          1          0\n");
    printf("x            = ");print_vec_uint64(x        );
    printf("x0           = ");print_vec_uint64(x0        );
    printf("x1           = ");print_vec_uint64(x1        );
    printf("x2           = ");print_vec_uint64(x2        );
    printf("x3           = ");print_vec_uint64(x3        );
*/
    return x3;
}

uint64_t _mm256_hxor_v2_uint64(__m256i x)
{
    __m256i x0 = _mm256_permute2x128_si256(x,x,1);
    __m256i x1 = _mm256_xor_si256(x,x0);
    __m256i x2 = _mm256_shuffle_epi32(x1,0b01001110);
    __m256i x3 = _mm256_xor_si256(x1,x2);
    return _mm_cvtsi128_si64x(_mm256_castsi256_si128(x3)) ;
}


int main() {
    __m256i x = _mm256_set_epi64x(0x7, 0x5, 0x2, 0xB);
//    __m256i x = _mm256_set_epi64x(4235566778345231, 1123312566778345423, 72345566778345673, 967856775433457);

    printf("x            = ");print_vec_uint64(x);

    __m256i y = _mm256_hxor_v2(x);

    printf("y            = ");print_vec_uint64(y);

    uint64_t z = _mm256_hxor_v2_uint64(x);

    printf("z =  %10lX  \n",z);

    return 0;
}


int print_vec_uint64(__m256i v){
    uint64_t t[4];
    _mm256_storeu_si256((__m256i *)t,v);
    printf("%10lX %10lX %10lX %10lX \n",t[3],t[2],t[1],t[0]);
    return 0;
}
like image 4
wim Avatar answered Nov 18 '22 16:11

wim


Implementation of direct analogue of _mm256_hadd_epi32() for XOR will be look something like this:

#include <immintrin.h>

template<int imm> inline __m256i _mm256_shuffle_epi32(__m256i a, __m256i b)
{
    return _mm256_castps_si256(_mm256_shuffle_ps(_mm256_castsi256_ps(a), _mm256_castsi256_ps(b), imm));
}

inline __m256i _mm256_hxor_epi32(__m256i a, __m256i b)
{
    return _mm256_xor_si256(_mm256_shuffle_epi32<0x88>(a, b), _mm256_shuffle_epi32<0xDD>(a, b));
}

int main()
{
    __m256i a = _mm256_setr_epi32(0, 1, 2, 3, 4, 5, 6, 7);
    __m256i b = _mm256_setr_epi32(1, 2, 3, 4, 5, 6, 7, 8);
    __m256i c = _mm256_hxor_epi32(a, b);
    return 0;
}
like image 2
ErmIg Avatar answered Nov 18 '22 18:11

ErmIg