Difference between _mm256_xor_si256() and _mm256_xor_ps()

Question

I am trying to find the actual difference between _mm256_xor_si256 and _mm256_xor_ps intrinsics from AVX(2).

They respectively map to the intel instructions:

vpxor ymm, ymm, ymm
vxorps ymm, ymm, ymm

Which are defined by Intel as:

dst[255:0] := (a[255:0] XOR b[255:0])
dst[MAX:256] := 0

versus

FOR j := 0 to 7
    i := j*32
    dst[i+31:i] := a[i+31:i] XOR b[i+31:i]
ENDFOR
dst[MAX:256] := 0

But frankly, I don't see a difference in their effects? They both xor 256 bits. But the latter can be used on AVX and AVX2, the first only on AVX2. Why would you ever use the first, with the lower compatibility?

harold · Accepted Answer

There is no difference in the effects, both do a bitwise XOR of 256 bits. But that doesn't mean there aren't differences, the differences only are less visible.

vxorps can, on Haswell, only go to port port 5 (and therefore has a throughput of 1), but vpxor can go to ports 0, 1 and 5, and has a throughput of 3/cycle. Also, there is a bypass delay when a result generated in floating point domain is used by an instruction that executes in the integer domain, and vice versa. So using the "wrong" instruction can have a slightly higher latency, which is why vxorps may be better in some contexts (but it's not so simple as "always when using floats").

I don't know for sure what AMD Excavator will do in that regard, but Bulldozer and Piledriver and Steamroller have those bypass delays, so I expect them in Excavator as well.

Difference between _mm256_xor_si256() and _mm256_xor_ps()

Tags:

avx

intrinsics

avx2

Bram

1 Answers

harold

Recent Activity

Donate For Us

Difference between _mm256_xor_si256() and _mm256_xor_ps()

Tags:

avx

intrinsics

avx2

Bram

1 Answers

harold

Related questions

Recent Activity

Donate For Us