Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Bitwise xor of two 256-bit integers

Tags:

avx

simd

sse

I have a AVX cpu (which doesn't support AVX2), and I want to compute bitwise xor of two 256 bits integer.

Since _mm256_xor_si256 is only available on AVX2, can I load these 256 bits as __m256 using _mm256_load_ps and then do a _mm256_xor_ps. Will this generate expected result?

My major concern is if the memory content is not a valid floating point number, will _mm256_load_ps not loading bits to registers exactly the same as that in memory?

Thanks.

like image 756
Kan Li Avatar asked Dec 17 '15 09:12

Kan Li


People also ask

How do you find the Bitwise XOR of two numbers?

To find each bit of XOR just calculate number of 1's in the corresponding bits. If it is even or zero then that XOR'ed bit is 0. If it is odd then that XOR'ed bit is 1.

How do you Bitwise XOR?

The ^ (bitwise XOR) in C or C++ takes two numbers as operands and does XOR on every bit of two numbers. The result of XOR is 1 if the two bits are different. The << (left shift) in C or C++ takes two numbers, left shifts the bits of the first operand, the second operand decides the number of places to shift.

What does XOR mean in Bitwise?

A bitwise XOR is a binary operation that takes two bit patterns of equal length and performs the logical exclusive OR operation on each pair of corresponding bits. The result in each position is 1 if only one of the bits is 1, but will be 0 if both are 0 or both are 1.

What is XOR example?

Exclusive disjunction is often used for bitwise operations. Examples: 1 XOR 1 = 0. 1 XOR 0 = 1.


3 Answers

First of all, if you're doing other things with your 256b integers (like adding/subtracting/multiplying), getting them into vector registers just for the occasional XOR may not be worth the overhead of transfering them. If you have two numbers already in registers (using up 8 total registers), it's only four xor instructions to get the result (and 4 mov instructions if you need to avoid overwriting the destination). The destructive version can run at one per 1.33 clock cycles on SnB, or one per clock on Haswell and later. (xor can run on any of the 4 ALU ports). So if you're just doing a single xor in between some add/adc or whatever, stick with integers.

Storing to memory in 64b chunks and then doing a 128b or 256b load would cause a store-forwarding failure, adding another several cycles of latency. Using movq / pinsrq would cost more execution resources than xor. Going the other way isn't as bad: 256b store -> 64b loads is fine for store forwarding. movq / pextrq still suck, but would have lower latency (at the cost of more uops).


FP load/store/bitwise operations are architecturally guaranteed not to generate FP exceptions, even when used on bit patterns that represent a signalling NaN. Only actual FP math instructions list math exceptions:

VADDPS

SIMD Floating-Point Exceptions
Overflow, Underflow, Invalid, Precision, Denormal.

VMOVAPS

SIMD Floating-Point Exceptions
None.

(From Intel's insn ref manual. See the x86 wiki for links to that and other stuff.)

On Intel hardware, either flavour of load/store can go to FP or integer domain without extra delay. AMD similarly behaves the same whichever flavour of load/store is used, regardless of where the data is going to / coming from.

Different flavours of vector move instruction actually matter for register<-register moves. On Intel Nehalem, using the wrong mov instruction can cause a bypass delay. On AMD Bulldozer-family, where moves are handled by register renaming rather than actually copying the data (like Intel IvB and later), the dest register inherits the domain of whatever wrote the src register.

No existing design I've read about has handled movapd any differently from movaps. Presumably Intel created movapd as much for decode simplicity as for future planning (e.g. to allow for the possibility of a design where there's a double domain and a single domain, with different forwarding networks). (movapd is movaps with a 66h prefix, just like the double version of every other SSE instruction just has the 66h prefix byte tacked on. Or F2 instead of F3 for scalar instructions.)

Apparently AMD designs tag FP vectors with auxiliary info, because Agner Fog found a large delay when using the output of addps as the input for addpd, for example. I don't think movaps between two addpd instructions, or even xorps would cause that problem, though: only actual FP math. (FP bitwise boolean ops are integer-domain on Bulldozer-family.)


Theoretical throughput on Intel SnB/IvB (the only Intel CPUs with AVX but not AVX2):

256b operations with AVX xorps

VMOVDQU   ymm0, [A]
VXORPS    ymm0, ymm0, [B]
VMOVDQU   [result], ymm0
  • 3 fused-domain uops can issue at one per 0.75 cycles since the pipeline width is 4 fused-domain uops. (Assuming the addressing modes you use for B and result can micro-fuse, otherwise it's 5 fused-domain uops.)

  • load port: 256b loads / stores on SnB take 2 cycles (split into 128b halves), but this frees up the AGU on port 2/3 to be used by the store. There's a dedicated store-data port, but store-address calculation needs the AGU from a load port.

    So with only 128b or smaller loads/stores, SnB/IvB can sustain two memory ops per cycle (with at most one of them being a store). With 256b ops, SnB/IvB can theoretically sustain two 256b loads and one 256b store per two cycles. Cache-bank conflicts usually make this impossible, though.

    Haswell has a dedicated store-address port, and can sustain two 256b loads and one 256b store per one cycle, and doesn't have cache bank conflicts. So Haswell is much faster when everything's in L1 cache.

Bottom line: In theory (no cache-bank conflicts) this should saturate SnB's load and store ports, processing 128b per cycle. Port5 (the only port xorps can run on) is needed once every two clocks.


128b ops

VMOVDQU   xmm0, [A]
VMOVDQU   xmm1, [A+16]
VPXOR     xmm0, xmm0, [B]
VPXOR     xmm1, xmm1, [B+16]
VMOVDQU   [result],    xmm0
VMOVDQU   [result+16], xmm1

This will bottleneck on address generation, since SnB can only sustain two 128b memory ops per cycle. It will also use 2x as much space in the uop cache, and more x86 machine code size. Barring cache-bank conflicts, this should run with a throughput of one 256b-xor per 3 clocks.


In registers

Between registers, one 256b VXORPS and two 128b VPXOR per clock would saturate SnB. On Haswell, three AVX2 256b VPXOR per clock would give the most XOR-ing per cycle. (XORPS and PXOR do the same thing, but XORPS's output can forward to the FP execution units without an extra cycle of forwarding delay. I guess only one execution units has the wiring to have an XOR result in the FP domain, so Intel CPUs post-Nehalem only run XORPS on one port.)


Z Boson's hybrid idea:

VMOVDQU   ymm0, [A]
VMOVDQU   ymm4, [B]
VEXTRACTF128 xmm1, ymm0, 1
VEXTRACTF128 xmm5, ymm1, 1
VPXOR     xmm0, xmm0, xmm4
VPXOR     xmm1, xmm1, xmm5
VMOVDQU   [res],    xmm0
VMOVDQU   [res+16], xmm1

Even more fused-domain uops (8) than just doing 128b-everything.

Load/store: two 256b loads leave two spare cycles for two store addresses to be generated, so this can still run at two loads/one store of 128b per cycle.

ALU: two port-5 uops (vextractf128), two port0/1/5 uops (vpxor).

So this still has a throughput of one 256b result per 2 clocks, but it's saturating more resources and has no advantage (on Intel) over the 3-instruction 256b version.

like image 88
Peter Cordes Avatar answered Nov 03 '22 04:11

Peter Cordes


There is no problem using _mm256_load_ps to load integers. In fact in this case it's better than using _mm256_load_si256 (which does work with AVX) because you stay in the floating point domain with _mm256_load_ps.

#include <x86intrin.h>
#include <stdio.h>

int main(void) {
    int a[8] = {1,2,3,4,5,6,7,8};
    int b[8] = {-2,-3,-4,-5,-6,-7,-8,-9};

    __m256 a8 = _mm256_loadu_ps((float*)a);
    __m256 b8 = _mm256_loadu_ps((float*)b);
    __m256 c8 = _mm256_xor_ps(a8,b8);
    int c[8]; _mm256_storeu_ps((float*)c, c8);
    printf("%x %x %x %x\n", c[0], c[1], c[2], c[3]);
}

If you want to stay in the integer domain you could do

#include <x86intrin.h>
#include <stdio.h>

int main(void) {
    int a[8] = {1,2,3,4,5,6,7,8};
    int b[8] = {-2,-3,-4,-5,-6,-7,-8,-9};

    __m256i a8 = _mm256_loadu_si256((__m256i*)a);
    __m256i b8 = _mm256_loadu_si256((__m256i*)b);
    __m128i a8lo = _mm256_castsi256_si128(a8);
    __m128i a8hi = _mm256_extractf128_si256(a8, 1);
    __m128i b8lo = _mm256_castsi256_si128(b8);
    __m128i b8hi = _mm256_extractf128_si256(b8, 1);
    __m128i c8lo = _mm_xor_si128(a8lo, b8lo);
    __m128i c8hi = _mm_xor_si128(a8hi, b8hi);
    int c[8];
    _mm_storeu_si128((__m128i*)&c[0],c8lo);
    _mm_storeu_si128((__m128i*)&c[4],c8hi);
    printf("%x %x %x %x\n", c[0], c[1], c[2], c[3]);
}

The _mm256_castsi256_si128 intrinsics are free.

like image 41
Z boson Avatar answered Nov 03 '22 02:11

Z boson


You will probably find that there is little or no difference in performance than if you used 2 x _mm_xor_si128. It's even possible that the AVX implementation will be slower, since _mm256_xor_ps has a reciprocal throughput of 1 on SB/IB/Haswell, whereas _mm_xor_si128 has a reciprocal throughput of 0.33.

like image 38
Paul R Avatar answered Nov 03 '22 04:11

Paul R