I am trying to find the actual difference between _mm256_xor_si256 and _mm256_xor_ps intrinsics from AVX(2).
They respectively map to the intel instructions:
Which are defined by Intel as:
dst[255:0] := (a[255:0] XOR b[255:0])
dst[MAX:256] := 0
versus
FOR j := 0 to 7
i := j*32
dst[i+31:i] := a[i+31:i] XOR b[i+31:i]
ENDFOR
dst[MAX:256] := 0
But frankly, I don't see a difference in their effects? They both xor 256 bits. But the latter can be used on AVX and AVX2, the first only on AVX2. Why would you ever use the first, with the lower compatibility?
There is no difference in the effects, both do a bitwise XOR of 256 bits. But that doesn't mean there aren't differences, the differences only are less visible.
vxorps can, on Haswell, only go to port port 5 (and therefore has a throughput of 1), but vpxor can go to ports 0, 1 and 5, and has a throughput of 3/cycle. Also, there is a bypass delay when a result generated in floating point domain is used by an instruction that executes in the integer domain, and vice versa. So using the "wrong" instruction can have a slightly higher latency, which is why vxorps may be better in some contexts (but it's not so simple as "always when using floats").
I don't know for sure what AMD Excavator will do in that regard, but Bulldozer and Piledriver and Steamroller have those bypass delays, so I expect them in Excavator as well.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With