Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Difference between _mm256_xor_si256() and _mm256_xor_ps()

I am trying to find the actual difference between _mm256_xor_si256 and _mm256_xor_ps intrinsics from AVX(2).

They respectively map to the intel instructions:

  • vpxor ymm, ymm, ymm
  • vxorps ymm, ymm, ymm

Which are defined by Intel as:

dst[255:0] := (a[255:0] XOR b[255:0])
dst[MAX:256] := 0

versus

FOR j := 0 to 7
    i := j*32
    dst[i+31:i] := a[i+31:i] XOR b[i+31:i]
ENDFOR
dst[MAX:256] := 0

But frankly, I don't see a difference in their effects? They both xor 256 bits. But the latter can be used on AVX and AVX2, the first only on AVX2. Why would you ever use the first, with the lower compatibility?

like image 810
Bram Avatar asked Dec 04 '25 13:12

Bram


1 Answers

There is no difference in the effects, both do a bitwise XOR of 256 bits. But that doesn't mean there aren't differences, the differences only are less visible.

vxorps can, on Haswell, only go to port port 5 (and therefore has a throughput of 1), but vpxor can go to ports 0, 1 and 5, and has a throughput of 3/cycle. Also, there is a bypass delay when a result generated in floating point domain is used by an instruction that executes in the integer domain, and vice versa. So using the "wrong" instruction can have a slightly higher latency, which is why vxorps may be better in some contexts (but it's not so simple as "always when using floats").

I don't know for sure what AMD Excavator will do in that regard, but Bulldozer and Piledriver and Steamroller have those bypass delays, so I expect them in Excavator as well.

like image 62
harold Avatar answered Dec 08 '25 07:12

harold



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!