I have a AVX cpu (which doesn't support AVX2), and I want to compute bitwise xor of two 256 bits integer.
Since _mm256_xor_si256
is only available on AVX2, can I load these 256 bits as __m256
using _mm256_load_ps
and then do a _mm256_xor_ps
. Will this generate expected result?
My major concern is if the memory content is not a valid floating point number, will _mm256_load_ps
not loading bits to registers exactly the same as that in memory?
Thanks.
To find each bit of XOR just calculate number of 1's in the corresponding bits. If it is even or zero then that XOR'ed bit is 0. If it is odd then that XOR'ed bit is 1.
The ^ (bitwise XOR) in C or C++ takes two numbers as operands and does XOR on every bit of two numbers. The result of XOR is 1 if the two bits are different. The << (left shift) in C or C++ takes two numbers, left shifts the bits of the first operand, the second operand decides the number of places to shift.
A bitwise XOR is a binary operation that takes two bit patterns of equal length and performs the logical exclusive OR operation on each pair of corresponding bits. The result in each position is 1 if only one of the bits is 1, but will be 0 if both are 0 or both are 1.
Exclusive disjunction is often used for bitwise operations. Examples: 1 XOR 1 = 0. 1 XOR 0 = 1.
First of all, if you're doing other things with your 256b integers (like adding/subtracting/multiplying), getting them into vector registers just for the occasional XOR may not be worth the overhead of transfering them. If you have two numbers already in registers (using up 8 total registers), it's only four xor
instructions to get the result (and 4 mov
instructions if you need to avoid overwriting the destination). The destructive version can run at one per 1.33 clock cycles on SnB, or one per clock on Haswell and later. (xor
can run on any of the 4 ALU ports). So if you're just doing a single xor
in between some add/adc
or whatever, stick with integers.
Storing to memory in 64b chunks and then doing a 128b or 256b load would cause a store-forwarding failure, adding another several cycles of latency. Using movq
/ pinsrq
would cost more execution resources than xor
. Going the other way isn't as bad: 256b store -> 64b loads is fine for store forwarding. movq
/ pextrq
still suck, but would have lower latency (at the cost of more uops).
FP load/store/bitwise operations are architecturally guaranteed not to generate FP exceptions, even when used on bit patterns that represent a signalling NaN. Only actual FP math instructions list math exceptions:
VADDPS
SIMD Floating-Point Exceptions
Overflow, Underflow, Invalid, Precision, Denormal.
VMOVAPS
SIMD Floating-Point Exceptions
None.
(From Intel's insn ref manual. See the x86 wiki for links to that and other stuff.)
On Intel hardware, either flavour of load/store can go to FP or integer domain without extra delay. AMD similarly behaves the same whichever flavour of load/store is used, regardless of where the data is going to / coming from.
Different flavours of vector move instruction actually matter for register<-register moves. On Intel Nehalem, using the wrong mov instruction can cause a bypass delay. On AMD Bulldozer-family, where moves are handled by register renaming rather than actually copying the data (like Intel IvB and later), the dest register inherits the domain of whatever wrote the src register.
No existing design I've read about has handled movapd
any differently from movaps
. Presumably Intel created movapd
as much for decode simplicity as for future planning (e.g. to allow for the possibility of a design where there's a double domain and a single domain, with different forwarding networks). (movapd
is movaps
with a 66h
prefix, just like the double version of every other SSE instruction just has the 66h
prefix byte tacked on. Or F2
instead of F3
for scalar instructions.)
Apparently AMD designs tag FP vectors with auxiliary info, because Agner Fog found a large delay when using the output of addps
as the input for addpd
, for example. I don't think movaps
between two addpd
instructions, or even xorps
would cause that problem, though: only actual FP math. (FP bitwise boolean ops are integer-domain on Bulldozer-family.)
Theoretical throughput on Intel SnB/IvB (the only Intel CPUs with AVX but not AVX2):
xorps
VMOVDQU ymm0, [A]
VXORPS ymm0, ymm0, [B]
VMOVDQU [result], ymm0
3 fused-domain uops can issue at one per 0.75 cycles since the pipeline width is 4 fused-domain uops. (Assuming the addressing modes you use for B and result can micro-fuse, otherwise it's 5 fused-domain uops.)
load port: 256b loads / stores on SnB take 2 cycles (split into 128b halves), but this frees up the AGU on port 2/3 to be used by the store. There's a dedicated store-data port, but store-address calculation needs the AGU from a load port.
So with only 128b or smaller loads/stores, SnB/IvB can sustain two memory ops per cycle (with at most one of them being a store). With 256b ops, SnB/IvB can theoretically sustain two 256b loads and one 256b store per two cycles. Cache-bank conflicts usually make this impossible, though.
Haswell has a dedicated store-address port, and can sustain two 256b loads and one 256b store per one cycle, and doesn't have cache bank conflicts. So Haswell is much faster when everything's in L1 cache.
Bottom line: In theory (no cache-bank conflicts) this should saturate SnB's load and store ports, processing 128b per cycle. Port5 (the only port xorps
can run on) is needed once every two clocks.
VMOVDQU xmm0, [A]
VMOVDQU xmm1, [A+16]
VPXOR xmm0, xmm0, [B]
VPXOR xmm1, xmm1, [B+16]
VMOVDQU [result], xmm0
VMOVDQU [result+16], xmm1
This will bottleneck on address generation, since SnB can only sustain two 128b memory ops per cycle. It will also use 2x as much space in the uop cache, and more x86 machine code size. Barring cache-bank conflicts, this should run with a throughput of one 256b-xor per 3 clocks.
Between registers, one 256b VXORPS
and two 128b VPXOR
per clock would saturate SnB. On Haswell, three AVX2 256b VPXOR
per clock would give the most XOR-ing per cycle. (XORPS
and PXOR
do the same thing, but XORPS
's output can forward to the FP execution units without an extra cycle of forwarding delay. I guess only one execution units has the wiring to have an XOR result in the FP domain, so Intel CPUs post-Nehalem only run XORPS on one port.)
VMOVDQU ymm0, [A]
VMOVDQU ymm4, [B]
VEXTRACTF128 xmm1, ymm0, 1
VEXTRACTF128 xmm5, ymm1, 1
VPXOR xmm0, xmm0, xmm4
VPXOR xmm1, xmm1, xmm5
VMOVDQU [res], xmm0
VMOVDQU [res+16], xmm1
Even more fused-domain uops (8) than just doing 128b-everything.
Load/store: two 256b loads leave two spare cycles for two store addresses to be generated, so this can still run at two loads/one store of 128b per cycle.
ALU: two port-5 uops (vextractf128), two port0/1/5 uops (vpxor).
So this still has a throughput of one 256b result per 2 clocks, but it's saturating more resources and has no advantage (on Intel) over the 3-instruction 256b version.
There is no problem using _mm256_load_ps
to load integers. In fact in this case it's better than using _mm256_load_si256
(which does work with AVX) because you stay in the floating point domain with _mm256_load_ps
.
#include <x86intrin.h>
#include <stdio.h>
int main(void) {
int a[8] = {1,2,3,4,5,6,7,8};
int b[8] = {-2,-3,-4,-5,-6,-7,-8,-9};
__m256 a8 = _mm256_loadu_ps((float*)a);
__m256 b8 = _mm256_loadu_ps((float*)b);
__m256 c8 = _mm256_xor_ps(a8,b8);
int c[8]; _mm256_storeu_ps((float*)c, c8);
printf("%x %x %x %x\n", c[0], c[1], c[2], c[3]);
}
If you want to stay in the integer domain you could do
#include <x86intrin.h>
#include <stdio.h>
int main(void) {
int a[8] = {1,2,3,4,5,6,7,8};
int b[8] = {-2,-3,-4,-5,-6,-7,-8,-9};
__m256i a8 = _mm256_loadu_si256((__m256i*)a);
__m256i b8 = _mm256_loadu_si256((__m256i*)b);
__m128i a8lo = _mm256_castsi256_si128(a8);
__m128i a8hi = _mm256_extractf128_si256(a8, 1);
__m128i b8lo = _mm256_castsi256_si128(b8);
__m128i b8hi = _mm256_extractf128_si256(b8, 1);
__m128i c8lo = _mm_xor_si128(a8lo, b8lo);
__m128i c8hi = _mm_xor_si128(a8hi, b8hi);
int c[8];
_mm_storeu_si128((__m128i*)&c[0],c8lo);
_mm_storeu_si128((__m128i*)&c[4],c8hi);
printf("%x %x %x %x\n", c[0], c[1], c[2], c[3]);
}
The _mm256_castsi256_si128
intrinsics are free.
You will probably find that there is little or no difference in performance than if you used 2 x _mm_xor_si128
. It's even possible that the AVX implementation will be slower, since _mm256_xor_ps
has a reciprocal throughput of 1 on SB/IB/Haswell, whereas _mm_xor_si128
has a reciprocal throughput of 0.33.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With