Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to simulate pcmpgtq on sse2?

PCMPGTQ was introduced in sse4.2, and it provides a greater than signed comparison for 64 bit numbers that yields a mask.

How does one support this functionality on instructions sets predating sse4.2?

Update: This same question applies to ARMv7 with Neon which also lacks a 64-bit comparator. The sister question to this is found here: What is the most efficient way to support CMGT with 64bit signed comparisons on ARMv7a with Neon?

like image 575
Dan Weber Avatar asked Dec 06 '20 08:12

Dan Weber


2 Answers

__m128i pcmpgtq_sse2 (__m128i a, __m128i b) {
    __m128i r = _mm_and_si128(_mm_cmpeq_epi32(a, b), _mm_sub_epi64(b, a));
    r = _mm_or_si128(r, _mm_cmpgt_epi32(a, b));
    return _mm_shuffle_epi32(r, _MM_SHUFFLE(3,3,1,1));
}

We have 32-bit signed comparison intrinsics so split the packed qwords into dwords pairs.

If the high dword in a is greater than the high dword in b then there is no need to compare the low dwords.

if (a.hi > b.hi) { r.hi = 0xFFFFFFFF; }
if (a.hi <= b.hi) { r.hi = 0x00000000; }

If the high dword in a is equal to the high dword in b then a 64-bit subtract will either clear or set all 32 high bits of the result (if the high dwords are equal then they "cancel" each other out, effectively a unsigned compare of the low dwords, placing the result in the high dwords).

if (a.hi == b.hi) { r = (b - a) & 0xFFFFFFFF00000000; }

Copy the comparison mask in the high 32-bits to the low 32-bits.

r.lo = r.hi

Updated: Here's the Godbolt for SSE2 and ARMv7+Neon.

like image 105
aqrit Avatar answered Sep 21 '22 06:09

aqrit


I'm not sure if this is the most optimal output, but this is the output for x64 from Clang. I've also taken the same implementation and converted it to support armv7 with neon.

See the Godbolt and the assembly below:

.LCPI0_0:
        .quad   2147483648                      # 0x80000000
        .quad   2147483648                      # 0x80000000
cmpgtq_sim(long __vector(2), long __vector(2)):                  # @cmpgtq_sim(long __vector(2), long __vector(2))
        movdqa  xmm2, xmmword ptr [rip + .LCPI0_0] # xmm2 = [2147483648,2147483648]
        pxor    xmm1, xmm2
        pxor    xmm0, xmm2
        movdqa  xmm2, xmm0
        pcmpgtd xmm2, xmm1
        pshufd  xmm3, xmm2, 160                 # xmm3 = xmm2[0,0,2,2]
        pcmpeqd xmm0, xmm1
        pshufd  xmm1, xmm0, 245                 # xmm1 = xmm0[1,1,3,3]
        pand    xmm1, xmm3
        pshufd  xmm0, xmm2, 245                 # xmm0 = xmm2[1,1,3,3]
        por     xmm0, xmm1

ARMv7+Neon

cmpgtq_sim(__simd128_int64_t, __simd128_int64_t):
        vldr    d16, .L3
        vldr    d17, .L3+8
        veor    q0, q0, q8
        veor    q1, q1, q8
        vcgt.s32        q8, q0, q1
        vceq.i32        q1, q0, q1
        vmov    q9, q8  @ v4si
        vmov    q10, q1  @ v4si
        vtrn.32 q9, q8
        vmov    q0, q8  @ v4si
        vmov    q8, q1  @ v4si
        vtrn.32 q10, q8
        vand    q8, q8, q9
        vorr    q0, q8, q0
        bx      lr
.L3:
        .word   -2147483648
        .word   0
        .word   -2147483648
        .word   0
like image 27
Dan Weber Avatar answered Sep 22 '22 06:09

Dan Weber