I was intrigued by clang's ability to convert many == comparisons of small integers to to one big SIMD instruction, but then I noticed something strange. Clang generated "worse" code(in my amateur evaluation) when I had 7 comparisons compared to the code when I had 8 comparisons.
bool f1(short x){
return (x==-1) | (x == 150) |
(x==5) | (x==64) |
(x==15) | (x==223) |
(x==42) | (x==47);
}
bool f2(short x){
return (x==-1) | (x == 150) |
(x==5) | (x==64) |
(x==15) | (x==223) |
(x==42);
}
My question is this a small performance bug, or clang has a very good reason for not wanting to introduce a dummy comparison(i.e. pretend that there is one extra comparison with one of the 7 values) and use one more constant in the code to achieve it.
godbolt link here:
# clang(trunk) -O2 -march=haswell
f1(short):
vmovd xmm0, edi
vpbroadcastw xmm0, xmm0 # set1(x)
vpcmpeqw xmm0, xmm0, xmmword ptr [rip + .LCPI0_0] # 16 bytes = 8 shorts
vpacksswb xmm0, xmm0, xmm0
vpmovmskb eax, xmm0
test al, al
setne al # booleanize the parallel-compare bitmask
ret
vs.
f2(short):
cmp di, -1
sete r8b
cmp edi, 150
sete dl
cmp di, 5 # scalar checks of 3 conditions
vmovd xmm0, edi
vpbroadcastw xmm0, xmm0
vpcmpeqw xmm0, xmm0, xmmword ptr [rip + .LCPI1_0] # low 8 bytes = 4 shorts
sete al
vpmovsxwd xmm0, xmm0
vmovmskps esi, xmm0
test sil, sil
setne cl # SIMD check of the other 4
or al, r8b
or al, dl
or al, cl # and combine.
ret
quickbench does not seem to work because IDK how to provide -mavx2 flag to it. (Editor's note: simply counting uops for front-end cost shows this is obviously worse for throughput. And also latency.)
It looks like clang's optimizer didn't think of duplicating an element to bring it up to a SIMD-convenient number of comparisons. But you're right, that would be better than doing extra scalar work. Clearly a missed optimization which should get reported as a clang/LLVM optimizer bug. https://bugs.llvm.org/
The asm for f1()
is clearly better than f2()
: vpacksswb xmm
has the same cost as vpmovsxwd xmm
on mainstream Intel and AMD CPUs, like other single-uop shuffles. And if anything vpmovsx
-> vmovmskps
could have bypass latency between integer and FP domains1.
Footnote 1: Probably no extra bypass latency on mainstream Intel CPUs with AVX2 (Sandybridge-family); integer shuffles between FP ops are typically fine, IIRC. (https://agner.org/optimize/). But for an SSE4.1 version on Nehalem, yes there might be an extra penalty the integer version wouldn't have.
You don't need AVX2, but word-broadcast in one instruction without a pshufb
control vector does make it more efficient. And clang chooses pshuflw
-> pshufd
for -march=nehalem
Of course, both versions are sub-optimal. There's no need to shuffle to compress the compare result before movemask.
Instead of test al, al
, it's possible to select which bits you want to check with test sil, 0b00001010
for example, to check bits 1 and 3 but ignore non-zero bits in other positions.
pcmpeqw
sets both bytes the same inside a word element so it's fine to pmovmskb
that result and get an integer with pairs of bits.
There's also zero benefit to using a byte register instead of a dword register: test sil,sil
should avoid the REX prefix and use test esi,esi
.
So even without duplicating one of the conditions, f2()
could be:
f2:
vmovd xmm0, edi
vpbroadcastw xmm0, xmm0 # set1(x)
vpcmpeqw xmm0, xmm0, xmmword ptr [rip + .LCPI0_0]
vpmovmskb eax, xmm0
test eax, 0b011111111111111 # (1<<15) - 1 = low 14 bits set
setne al
ret
That test
will set ZF according to the low 14 bits of the pmovmksb
result, because the higher bits are cleared in the TEST mask. TEST = AND that doesn't write its output. Often useful for selecting parts of a compare mask.
But since we need a 16-byte constant in memory in the first place, yes we should duplicate one of the elements to pad it up to 8 elements. Then we can use test eax,eax
like a normal person. Compressing the mask to fit in 8-bit AL
is a total waste of time and code-size. test r32, r32
is just as fast as test r8,r8
and doesn't need a REX prefix for SIL, DIL, or BPL.
Fun fact: AVX512VL would let us use vpbroadcastw xmm0, edi
to combine the movd
and broadcast.
Or to compare only 4 elements, instead of extra shuffling for movmskps
, we only need SSE2 here. And using a mask truly is useful.
test_4_possibilities_SSE2:
movd xmm0, edi
pshufd xmm0, xmm0, 0 # set1_epi32(x)
pcmpeqw xmm0, [const] # == set_epi32(a, b, c, d)
pmovmskb eax, xmm0
test eax, 0b0001000100010001 # the low bit of each group of 4
setne al
ret
We do a dword broadcast and ignore the compare result in the high 16 bits of each 32-bit element. Using a mask for test
lets us do that more cheaply than any extra instruction would.
Without AVX2, a SIMD dword broadcast with pshufd
is cheaper than needing a word broadcast.
Another option is to imul
with 0x00010001
to broadcast a word into a 32-bit register, but that has 3 cycle latency so it's potentially worse than punpcklwd
-> pshufd
Inside a loop, though, it would be worth loading a control vector for pshufb
(SSSE3) instead of using 2 shuffles or an imul.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With