PCMPGTQ was introduced in sse4.2, and it provides a greater than signed comparison for 64 bit numbers that yields a mask.
How does one support this functionality on instructions sets predating sse4.2?
Update: This same question applies to ARMv7 with Neon which also lacks a 64-bit comparator. The sister question to this is found here: What is the most efficient way to support CMGT with 64bit signed comparisons on ARMv7a with Neon?
__m128i pcmpgtq_sse2 (__m128i a, __m128i b) {
__m128i r = _mm_and_si128(_mm_cmpeq_epi32(a, b), _mm_sub_epi64(b, a));
r = _mm_or_si128(r, _mm_cmpgt_epi32(a, b));
return _mm_shuffle_epi32(r, _MM_SHUFFLE(3,3,1,1));
}
We have 32-bit signed comparison intrinsics so split the packed qwords into dwords pairs.
If the high dword in a
is greater than the high dword in b
then there is no need to compare the low dwords.
if (a.hi > b.hi) { r.hi = 0xFFFFFFFF; }
if (a.hi <= b.hi) { r.hi = 0x00000000; }
If the high dword in a
is equal to the high dword in b
then a 64-bit subtract will either clear or set all 32 high bits of the result (if the high dwords are equal then they "cancel" each other out, effectively a unsigned compare of the low dwords, placing the result in the high dwords).
if (a.hi == b.hi) { r = (b - a) & 0xFFFFFFFF00000000; }
Copy the comparison mask in the high 32-bits to the low 32-bits.
r.lo = r.hi
Updated: Here's the Godbolt for SSE2 and ARMv7+Neon.
I'm not sure if this is the most optimal output, but this is the output for x64 from Clang. I've also taken the same implementation and converted it to support armv7 with neon.
See the Godbolt and the assembly below:
.LCPI0_0:
.quad 2147483648 # 0x80000000
.quad 2147483648 # 0x80000000
cmpgtq_sim(long __vector(2), long __vector(2)): # @cmpgtq_sim(long __vector(2), long __vector(2))
movdqa xmm2, xmmword ptr [rip + .LCPI0_0] # xmm2 = [2147483648,2147483648]
pxor xmm1, xmm2
pxor xmm0, xmm2
movdqa xmm2, xmm0
pcmpgtd xmm2, xmm1
pshufd xmm3, xmm2, 160 # xmm3 = xmm2[0,0,2,2]
pcmpeqd xmm0, xmm1
pshufd xmm1, xmm0, 245 # xmm1 = xmm0[1,1,3,3]
pand xmm1, xmm3
pshufd xmm0, xmm2, 245 # xmm0 = xmm2[1,1,3,3]
por xmm0, xmm1
ARMv7+Neon
cmpgtq_sim(__simd128_int64_t, __simd128_int64_t):
vldr d16, .L3
vldr d17, .L3+8
veor q0, q0, q8
veor q1, q1, q8
vcgt.s32 q8, q0, q1
vceq.i32 q1, q0, q1
vmov q9, q8 @ v4si
vmov q10, q1 @ v4si
vtrn.32 q9, q8
vmov q0, q8 @ v4si
vmov q8, q1 @ v4si
vtrn.32 q10, q8
vand q8, q8, q9
vorr q0, q8, q0
bx lr
.L3:
.word -2147483648
.word 0
.word -2147483648
.word 0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With