I have a AVX cpu (which doesn't support AVX2), and I want to compute bitwise xor of two 256 bits integer. Since <code>_mm256_xor_si256</code> is only available on AVX2, can I load these 256 bits as <code>__m256</code> using <code>_mm256_load_ps</code> and then do a <code>_mm256_xor_ps</code>. Will this generate expected result? My major concern is if the memory content is not a valid floating point number, will <code>_mm256_load_ps</code> not loading bits to registers exactly the same as that in memory? Thanks.

First of all, if you're doing other things with your 256b integers (like adding/subtracting/multiplying), getting them into vector registers just for the occasional XOR may not be worth the overhead of transfering them. If you have two numbers already in registers (using up 8 total registers), it's only four <code>xor</code> instructions to get the result (and 4 <code>mov</code> instructions if you need to avoid overwriting the destination). The destructive version can run at one per 1.33 clock cycles on SnB, or one per clock on Haswell and later. (<code>xor</code> can run on any of the 4 ALU ports). So if you're just doing a single <code>xor</code> in between some <code>add/adc</code> or whatever, stick with integers. Storing to memory in 64b chunks and then doing a 128b or 256b load would cause a store-forwarding failure, adding another several cycles of latency. Using <code>movq</code> / <code>pinsrq</code> would cost more execution resources than <code>xor</code>. Going the other way isn't as bad: 256b store -> 64b loads is fine for store forwarding. <code>movq</code> / <code>pextrq</code> still suck, but would have lower latency (at the cost of more uops). <hr> FP load/store/bitwise operations are architecturally guaranteed not to generate FP exceptions, even when used on bit patterns that represent a signalling NaN. Only actual FP math instructions list math exceptions: <blockquote> <code>VADDPS</code> SIMD Floating-Point Exceptions Overflow, Underflow, Invalid, Precision, Denormal. </blockquote> <blockquote> <code>VMOVAPS</code> SIMD Floating-Point Exceptions None. </blockquote> (From Intel's insn ref manual. See the x86 wiki for links to that and other stuff.) On Intel hardware, either flavour of load/store can go to FP or integer domain without extra delay. AMD similarly behaves the same whichever flavour of load/store is used, regardless of where the data is going to / coming from. Different flavours of vector move instruction actually matter for register<-register moves. On Intel Nehalem, using the wrong mov instruction can cause a bypass delay. On AMD Bulldozer-family, where moves are handled by register renaming rather than actually copying the data (like Intel IvB and later), the dest register inherits the domain of whatever wrote the src register. No existing design I've read about has handled <code>movapd</code> any differently from <code>movaps</code>. Presumably Intel created <code>movapd</code> as much for decode simplicity as for future planning (e.g. to allow for the possibility of a design where there's a double domain and a single domain, with different forwarding networks). (<code>movapd</code> is <code>movaps</code> with a <code>66h</code> prefix, just like the double version of every other SSE instruction just has the <code>66h</code> prefix byte tacked on. Or <code>F2</code> instead of <code>F3</code> for scalar instructions.) Apparently AMD designs tag FP vectors with auxiliary info, because Agner Fog found a large delay when using the output of <code>addps</code> as the input for <code>addpd</code>, for example. I don't think <code>movaps</code> between two <code>addpd</code> instructions, or even <code>xorps</code> would cause that problem, though: only actual FP math. (FP bitwise boolean ops are integer-domain on Bulldozer-family.) <hr> Theoretical throughput on Intel SnB/IvB (the only Intel CPUs with AVX but not AVX2): <h3>256b operations with AVX <code>xorps</code> </h3> <pre class="prettyprint"><code>VMOVDQU ymm0, [A] VXORPS ymm0, ymm0, [B] VMOVDQU [result], ymm0 </code></pre> <ul> <li>3 fused-domain uops can issue at one per 0.75 cycles since the pipeline width is 4 fused-domain uops. (Assuming the addressing modes you use for B and result can micro-fuse, otherwise it's 5 fused-domain uops.)</li> <li> load port: 256b loads / stores on SnB take 2 cycles (split into 128b halves), but this frees up the AGU on port 2/3 to be used by the store. There's a dedicated store-data port, but store-address calculation needs the AGU from a load port. So with only 128b or smaller loads/stores, SnB/IvB can sustain two memory ops per cycle (with at most one of them being a store). With 256b ops, SnB/IvB can theoretically sustain two 256b loads and one 256b store per two cycles. Cache-bank conflicts usually make this impossible, though. Haswell has a dedicated store-address port, and can sustain two 256b loads and one 256b store per one cycle, and doesn't have cache bank conflicts. So Haswell is much faster when everything's in L1 cache. </li> </ul> Bottom line: In theory (no cache-bank conflicts) this should saturate SnB's load and store ports, processing 128b per cycle. Port5 (the only port <code>xorps</code> can run on) is needed once every two clocks. <hr> <h3>128b ops</h3> <pre class="prettyprint"><code>VMOVDQU xmm0, [A] VMOVDQU xmm1, [A+16] VPXOR xmm0, xmm0, [B] VPXOR xmm1, xmm1, [B+16] VMOVDQU [result], xmm0 VMOVDQU [result+16], xmm1 </code></pre> This will bottleneck on address generation, since SnB can only sustain two 128b memory ops per cycle. It will also use 2x as much space in the uop cache, and more x86 machine code size. Barring cache-bank conflicts, this should run with a throughput of one 256b-xor per 3 clocks. <hr> <h3>In registers</h3> Between registers, one 256b <code>VXORPS</code> and two 128b <code>VPXOR</code> per clock would saturate SnB. On Haswell, three AVX2 256b <code>VPXOR</code> per clock would give the most XOR-ing per cycle. (<code>XORPS</code> and <code>PXOR</code> do the same thing, but <code>XORPS</code>'s output can forward to the FP execution units without an extra cycle of forwarding delay. I guess only one execution units has the wiring to have an XOR result in the FP domain, so Intel CPUs post-Nehalem only run XORPS on one port.) <hr> <h3>Z Boson's hybrid idea:</h3> <pre class="prettyprint"><code>VMOVDQU ymm0, [A] VMOVDQU ymm4, [B] VEXTRACTF128 xmm1, ymm0, 1 VEXTRACTF128 xmm5, ymm1, 1 VPXOR xmm0, xmm0, xmm4 VPXOR xmm1, xmm1, xmm5 VMOVDQU [res], xmm0 VMOVDQU [res+16], xmm1 </code></pre> Even more fused-domain uops (8) than just doing 128b-everything. Load/store: two 256b loads leave two spare cycles for two store addresses to be generated, so this can still run at two loads/one store of 128b per cycle. ALU: two port-5 uops (vextractf128), two port0/1/5 uops (vpxor). So this still has a throughput of one 256b result per 2 clocks, but it's saturating more resources and has no advantage (on Intel) over the 3-instruction 256b version.

There is no problem using <code>_mm256_load_ps</code> to load integers. In fact in this case it's better than using <code>_mm256_load_si256</code> (which does work with AVX) because you stay in the floating point domain with <code>_mm256_load_ps</code>. <pre class="prettyprint lang-c prettyprint-override"><code>#include <x86intrin.h> #include <stdio.h> int main(void) { int a[8] = {1,2,3,4,5,6,7,8}; int b[8] = {-2,-3,-4,-5,-6,-7,-8,-9}; __m256 a8 = _mm256_loadu_ps((float*)a); __m256 b8 = _mm256_loadu_ps((float*)b); __m256 c8 = _mm256_xor_ps(a8,b8); int c[8]; _mm256_storeu_ps((float*)c, c8); printf("%x %x %x %x\n", c[0], c[1], c[2], c[3]); } </code></pre> If you want to stay in the integer domain you could do <pre class="prettyprint lang-c prettyprint-override"><code>#include <x86intrin.h> #include <stdio.h> int main(void) { int a[8] = {1,2,3,4,5,6,7,8}; int b[8] = {-2,-3,-4,-5,-6,-7,-8,-9}; __m256i a8 = _mm256_loadu_si256((__m256i*)a); __m256i b8 = _mm256_loadu_si256((__m256i*)b); __m128i a8lo = _mm256_castsi256_si128(a8); __m128i a8hi = _mm256_extractf128_si256(a8, 1); __m128i b8lo = _mm256_castsi256_si128(b8); __m128i b8hi = _mm256_extractf128_si256(b8, 1); __m128i c8lo = _mm_xor_si128(a8lo, b8lo); __m128i c8hi = _mm_xor_si128(a8hi, b8hi); int c[8]; _mm_storeu_si128((__m128i*)&c[0],c8lo); _mm_storeu_si128((__m128i*)&c[4],c8hi); printf("%x %x %x %x\n", c[0], c[1], c[2], c[3]); } </code></pre> The <code>_mm256_castsi256_si128</code> intrinsics are free.

Bitwise xor of two 256-bit integers

Q: What does XOR mean in Bitwise?

A bitwise XOR is a binary operation that takes two bit patterns of equal length and performs the logical exclusive OR operation on each pair of corresponding bits. The result in each position is 1 if only one of the bits is 1, but will be 0 if both are 0 or both are 1.

Q: What is XOR example?

Exclusive disjunction is often used for bitwise operations. Examples: 1 XOR 1 = 0. 1 XOR 0 = 1.

3 Answers

First of all, if you're doing other things with your 256b integers (like adding/subtracting/multiplying), getting them into vector registers just for the occasional XOR may not be worth the overhead of transfering them. If you have two numbers already in registers (using up 8 total registers), it's only four xor instructions to get the result (and 4 mov instructions if you need to avoid overwriting the destination). The destructive version can run at one per 1.33 clock cycles on SnB, or one per clock on Haswell and later. (xor can run on any of the 4 ALU ports). So if you're just doing a single xor in between some add/adc or whatever, stick with integers.

Storing to memory in 64b chunks and then doing a 128b or 256b load would cause a store-forwarding failure, adding another several cycles of latency. Using movq / pinsrq would cost more execution resources than xor. Going the other way isn't as bad: 256b store -> 64b loads is fine for store forwarding. movq / pextrq still suck, but would have lower latency (at the cost of more uops).

FP load/store/bitwise operations are architecturally guaranteed not to generate FP exceptions, even when used on bit patterns that represent a signalling NaN. Only actual FP math instructions list math exceptions:

VADDPS

SIMD Floating-Point Exceptions
Overflow, Underflow, Invalid, Precision, Denormal.

VMOVAPS

SIMD Floating-Point Exceptions
None.

(From Intel's insn ref manual. See the x86 wiki for links to that and other stuff.)

On Intel hardware, either flavour of load/store can go to FP or integer domain without extra delay. AMD similarly behaves the same whichever flavour of load/store is used, regardless of where the data is going to / coming from.

Different flavours of vector move instruction actually matter for register<-register moves. On Intel Nehalem, using the wrong mov instruction can cause a bypass delay. On AMD Bulldozer-family, where moves are handled by register renaming rather than actually copying the data (like Intel IvB and later), the dest register inherits the domain of whatever wrote the src register.

No existing design I've read about has handled movapd any differently from movaps. Presumably Intel created movapd as much for decode simplicity as for future planning (e.g. to allow for the possibility of a design where there's a double domain and a single domain, with different forwarding networks). (movapd is movaps with a 66h prefix, just like the double version of every other SSE instruction just has the 66h prefix byte tacked on. Or F2 instead of F3 for scalar instructions.)

Apparently AMD designs tag FP vectors with auxiliary info, because Agner Fog found a large delay when using the output of addps as the input for addpd, for example. I don't think movaps between two addpd instructions, or even xorps would cause that problem, though: only actual FP math. (FP bitwise boolean ops are integer-domain on Bulldozer-family.)

Theoretical throughput on Intel SnB/IvB (the only Intel CPUs with AVX but not AVX2):

256b operations with AVX `xorps`

VMOVDQU   ymm0, [A]
VXORPS    ymm0, ymm0, [B]
VMOVDQU   [result], ymm0

3 fused-domain uops can issue at one per 0.75 cycles since the pipeline width is 4 fused-domain uops. (Assuming the addressing modes you use for B and result can micro-fuse, otherwise it's 5 fused-domain uops.)
load port: 256b loads / stores on SnB take 2 cycles (split into 128b halves), but this frees up the AGU on port 2/3 to be used by the store. There's a dedicated store-data port, but store-address calculation needs the AGU from a load port.

So with only 128b or smaller loads/stores, SnB/IvB can sustain two memory ops per cycle (with at most one of them being a store). With 256b ops, SnB/IvB can theoretically sustain two 256b loads and one 256b store per two cycles. Cache-bank conflicts usually make this impossible, though.

Haswell has a dedicated store-address port, and can sustain two 256b loads and one 256b store per one cycle, and doesn't have cache bank conflicts. So Haswell is much faster when everything's in L1 cache.

Bottom line: In theory (no cache-bank conflicts) this should saturate SnB's load and store ports, processing 128b per cycle. Port5 (the only port xorps can run on) is needed once every two clocks.

128b ops

VMOVDQU   xmm0, [A]
VMOVDQU   xmm1, [A+16]
VPXOR     xmm0, xmm0, [B]
VPXOR     xmm1, xmm1, [B+16]
VMOVDQU   [result],    xmm0
VMOVDQU   [result+16], xmm1

This will bottleneck on address generation, since SnB can only sustain two 128b memory ops per cycle. It will also use 2x as much space in the uop cache, and more x86 machine code size. Barring cache-bank conflicts, this should run with a throughput of one 256b-xor per 3 clocks.

In registers

Between registers, one 256b VXORPS and two 128b VPXOR per clock would saturate SnB. On Haswell, three AVX2 256b VPXOR per clock would give the most XOR-ing per cycle. (XORPS and PXOR do the same thing, but XORPS's output can forward to the FP execution units without an extra cycle of forwarding delay. I guess only one execution units has the wiring to have an XOR result in the FP domain, so Intel CPUs post-Nehalem only run XORPS on one port.)

Z Boson's hybrid idea:

VMOVDQU   ymm0, [A]
VMOVDQU   ymm4, [B]
VEXTRACTF128 xmm1, ymm0, 1
VEXTRACTF128 xmm5, ymm1, 1
VPXOR     xmm0, xmm0, xmm4
VPXOR     xmm1, xmm1, xmm5
VMOVDQU   [res],    xmm0
VMOVDQU   [res+16], xmm1

Even more fused-domain uops (8) than just doing 128b-everything.

Load/store: two 256b loads leave two spare cycles for two store addresses to be generated, so this can still run at two loads/one store of 128b per cycle.

ALU: two port-5 uops (vextractf128), two port0/1/5 uops (vpxor).

So this still has a throughput of one 256b result per 2 clocks, but it's saturating more resources and has no advantage (on Intel) over the 3-instruction 256b version.

answered Nov 03 '22 04:11

Peter Cordes

There is no problem using _mm256_load_ps to load integers. In fact in this case it's better than using _mm256_load_si256 (which does work with AVX) because you stay in the floating point domain with _mm256_load_ps.

#include <x86intrin.h>
#include <stdio.h>

int main(void) {
    int a[8] = {1,2,3,4,5,6,7,8};
    int b[8] = {-2,-3,-4,-5,-6,-7,-8,-9};

    __m256 a8 = _mm256_loadu_ps((float*)a);
    __m256 b8 = _mm256_loadu_ps((float*)b);
    __m256 c8 = _mm256_xor_ps(a8,b8);
    int c[8]; _mm256_storeu_ps((float*)c, c8);
    printf("%x %x %x %x\n", c[0], c[1], c[2], c[3]);
}

If you want to stay in the integer domain you could do

#include <x86intrin.h>
#include <stdio.h>

int main(void) {
    int a[8] = {1,2,3,4,5,6,7,8};
    int b[8] = {-2,-3,-4,-5,-6,-7,-8,-9};

    __m256i a8 = _mm256_loadu_si256((__m256i*)a);
    __m256i b8 = _mm256_loadu_si256((__m256i*)b);
    __m128i a8lo = _mm256_castsi256_si128(a8);
    __m128i a8hi = _mm256_extractf128_si256(a8, 1);
    __m128i b8lo = _mm256_castsi256_si128(b8);
    __m128i b8hi = _mm256_extractf128_si256(b8, 1);
    __m128i c8lo = _mm_xor_si128(a8lo, b8lo);
    __m128i c8hi = _mm_xor_si128(a8hi, b8hi);
    int c[8];
    _mm_storeu_si128((__m128i*)&c[0],c8lo);
    _mm_storeu_si128((__m128i*)&c[4],c8hi);
    printf("%x %x %x %x\n", c[0], c[1], c[2], c[3]);
}

The _mm256_castsi256_si128 intrinsics are free.

answered Nov 03 '22 02:11

Z boson

You will probably find that there is little or no difference in performance than if you used 2 x _mm_xor_si128. It's even possible that the AVX implementation will be slower, since _mm256_xor_ps has a reciprocal throughput of 1 on SB/IB/Haswell, whereas _mm_xor_si128 has a reciprocal throughput of 0.33.

answered Nov 03 '22 04:11

Paul R

Related questions
                            
                                How to optimize "u[0]*v[0] + u[2]*v[2]" code line with SSE or GLSL
                            
                                Unable to detect why the following piece of code was not vectorized
                            
                                Aligned types and passing arguments by value
                            
                                approximating log10[x^k0 + k1]
                            
                                Tensorflow installation using SSE instructions with pip
                            
                                How to do an indirect load (gather-scatter) in AVX or SSE instructions?
                            
                                Is there a good double-precision small matrix SIMD library for x86?
                            
                                Atomic 16 byte read on x64 CPUs
                            
                                Is it possible to use SSE and SSE2 to make a 128-bit wide integer?
                            
                                Most efficient way to store 4 dot products into a contiguous array in C using SSE intrinsics
                            
                                Is it okay to mix legacy SSE encoded instructions and VEX encoded ones in the same code path?
                            
                                Fast counting the number of equal bytes between two arrays [duplicate]
                            
                                Where is VPERMB in AVX2?
                            
                                Vectorizing Modular Arithmetic
                            
                                Load constant floats into SSE registers
                            
                                Is it possible to vectorize myNum += a[b[i]] * c[i]; on x86_64?
                            
                                What's the difference between __popcnt() and _mm_popcnt_u32()?
                            
                                AVX/SSE version of xorshift128+
                            
                                SSE and C++ containers
                            
                                128-bit values - From XMM registers to General Purpose

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Bitwise xor of two 256-bit integers

Tags:

avx

simd

sse

Kan Li

People also ask

3 Answers

256b operations with AVX `xorps`

128b ops

In registers

Z Boson's hybrid idea:

Peter Cordes

Z boson

Paul R

Recent Activity

Donate For Us

Bitwise xor of two 256-bit integers

Tags:

avx

simd

sse

Kan Li

People also ask

3 Answers

256b operations with AVX xorps

128b ops

In registers

Z Boson's hybrid idea:

Peter Cordes

Z boson

Paul R

Related questions

Recent Activity

Donate For Us

256b operations with AVX `xorps`