Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Clang generates worse code for 7 comparisons than for 8 comparisons

Tags:

intel

I was intrigued by clang's ability to convert many == comparisons of small integers to to one big SIMD instruction, but then I noticed something strange. Clang generated "worse" code(in my amateur evaluation) when I had 7 comparisons compared to the code when I had 8 comparisons.

bool f1(short x){
    return (x==-1) | (x == 150) |
           (x==5) | (x==64) | 
           (x==15) | (x==223) | 
           (x==42) | (x==47);
}

bool f2(short x){
    return (x==-1) | (x == 150) |
           (x==5) | (x==64) | 
           (x==15) | (x==223) | 
           (x==42);
}

My question is this a small performance bug, or clang has a very good reason for not wanting to introduce a dummy comparison(i.e. pretend that there is one extra comparison with one of the 7 values) and use one more constant in the code to achieve it.

godbolt link here:

# clang(trunk) -O2 -march=haswell
f1(short):
    vmovd   xmm0, edi
    vpbroadcastw    xmm0, xmm0             # set1(x)
    vpcmpeqw        xmm0, xmm0, xmmword ptr [rip + .LCPI0_0]  # 16 bytes = 8 shorts
    vpacksswb       xmm0, xmm0, xmm0
    vpmovmskb       eax, xmm0
    test    al, al
    setne   al           # booleanize the parallel-compare bitmask
    ret

vs.

f2(short):
    cmp     di, -1
    sete    r8b
    cmp     edi, 150
    sete    dl
    cmp     di, 5             # scalar checks of 3 conditions
    vmovd   xmm0, edi
    vpbroadcastw    xmm0, xmm0
    vpcmpeqw        xmm0, xmm0, xmmword ptr [rip + .LCPI1_0]  # low 8 bytes = 4 shorts
    sete    al
    vpmovsxwd       xmm0, xmm0
    vmovmskps       esi, xmm0
    test    sil, sil
    setne   cl                # SIMD check of the other 4
    or      al, r8b
    or      al, dl
    or      al, cl            # and combine.
    ret

quickbench does not seem to work because IDK how to provide -mavx2 flag to it. (Editor's note: simply counting uops for front-end cost shows this is obviously worse for throughput. And also latency.)

like image 939
NoSenseEtAl Avatar asked Sep 23 '19 20:09

NoSenseEtAl


1 Answers

It looks like clang's optimizer didn't think of duplicating an element to bring it up to a SIMD-convenient number of comparisons. But you're right, that would be better than doing extra scalar work. Clearly a missed optimization which should get reported as a clang/LLVM optimizer bug. https://bugs.llvm.org/


The asm for f1() is clearly better than f2(): vpacksswb xmm has the same cost as vpmovsxwd xmm on mainstream Intel and AMD CPUs, like other single-uop shuffles. And if anything vpmovsx -> vmovmskps could have bypass latency between integer and FP domains1.


Footnote 1: Probably no extra bypass latency on mainstream Intel CPUs with AVX2 (Sandybridge-family); integer shuffles between FP ops are typically fine, IIRC. (https://agner.org/optimize/). But for an SSE4.1 version on Nehalem, yes there might be an extra penalty the integer version wouldn't have.

You don't need AVX2, but word-broadcast in one instruction without a pshufb control vector does make it more efficient. And clang chooses pshuflw -> pshufd for -march=nehalem


Of course, both versions are sub-optimal. There's no need to shuffle to compress the compare result before movemask.

Instead of test al, al, it's possible to select which bits you want to check with test sil, 0b00001010 for example, to check bits 1 and 3 but ignore non-zero bits in other positions.

pcmpeqw sets both bytes the same inside a word element so it's fine to pmovmskb that result and get an integer with pairs of bits.

There's also zero benefit to using a byte register instead of a dword register: test sil,sil should avoid the REX prefix and use test esi,esi.

So even without duplicating one of the conditions, f2() could be:

f2:
    vmovd           xmm0, edi
    vpbroadcastw    xmm0, xmm0             # set1(x)
    vpcmpeqw        xmm0, xmm0, xmmword ptr [rip + .LCPI0_0]
    vpmovmskb       eax, xmm0
    test    eax, 0b011111111111111    # (1<<15) - 1 = low 14 bits set
    setne   al
    ret

That test will set ZF according to the low 14 bits of the pmovmksb result, because the higher bits are cleared in the TEST mask. TEST = AND that doesn't write its output. Often useful for selecting parts of a compare mask.

But since we need a 16-byte constant in memory in the first place, yes we should duplicate one of the elements to pad it up to 8 elements. Then we can use test eax,eax like a normal person. Compressing the mask to fit in 8-bit AL is a total waste of time and code-size. test r32, r32 is just as fast as test r8,r8 and doesn't need a REX prefix for SIL, DIL, or BPL.

Fun fact: AVX512VL would let us use vpbroadcastw xmm0, edi to combine the movd and broadcast.


Or to compare only 4 elements, instead of extra shuffling for movmskps, we only need SSE2 here. And using a mask truly is useful.

test_4_possibilities_SSE2:
    movd            xmm0, edi
    pshufd          xmm0, xmm0, 0             # set1_epi32(x)
    pcmpeqw         xmm0, [const]             # == set_epi32(a, b, c, d)
    pmovmskb        eax, xmm0
    test    eax, 0b0001000100010001     # the low bit of each group of 4
    setne   al
    ret

We do a dword broadcast and ignore the compare result in the high 16 bits of each 32-bit element. Using a mask for test lets us do that more cheaply than any extra instruction would.

Without AVX2, a SIMD dword broadcast with pshufd is cheaper than needing a word broadcast.

Another option is to imul with 0x00010001 to broadcast a word into a 32-bit register, but that has 3 cycle latency so it's potentially worse than punpcklwd -> pshufd

Inside a loop, though, it would be worth loading a control vector for pshufb (SSSE3) instead of using 2 shuffles or an imul.

like image 157
Peter Cordes Avatar answered Oct 27 '22 01:10

Peter Cordes