Let's say you want to find the first occurrence of a value1 in a sorted array. For small arrays (where things like binary search don't pay off), you can achieve this by simply counting the number of values less than that value: the result is the index you are after.
In x86 you can use adc
(add with carry) for an efficient branch-free2 implementation of that approach (with the start pointer in rdi
length in rsi
and the value to search for in edx
):
xor eax, eax
lea rdi, [rdi + rsi*4] ; pointer to end of array = base + length
neg rsi ; we loop from -length to zero
loop:
cmp [rdi + 4 * rsi], edx
adc rax, 0 ; only a single uop on Sandybridge-family even before BDW
inc rsi
jnz loop
The answer ends up in rax
. If you unroll that (or if you have a fixed, known input size), only the cmp; adc
pair of instructions get repeated, so the overhead approaches 2 simple instructions per comparison (and the sometimes fused load). Which Intel microarchitecture introduced the ADC reg,0 single-uop special case?
However, this only works for unsigned comparisons, where the carry flag holds the result of the comparison. Is there any equivalently efficient sequence for counting signed comparisons? Unfortunately, there doesn't seem to be an "add 1 if less than" instruction: adc
, sbb
and the carry flag are special in that respect.
I am interested in the general case where the elements have no specific order, and also in this case where the array is sorted in the case the sortedness assumption leads to a simpler or faster implementation.
1 Or, if the value doesn't exist, the first greater value. I.e., this is the so called "lower bound" search.
2 Branch free approaches necessarily do the same amount of work each time - in this case examining the entire array, so this approach only make sense when the arrays are small and so the cost of a branch misprediction is large relative to the total search time.
PCMPGT + PADDD or PSUBD is probably a really good idea for most CPUs, even for small sizes, maybe with a simple scalar cleanup. Or even just purely scalar, using movd
loads, see below.
For scalar integer, avoiding XMM regs, use SETCC to create a 0/1 integer from any flag condition you want. xor-zero a tmp register (potentially outside the loop) and SETCC into the low 8 of that, if you want to use 32 or 64-bit ADD instructions instead of only 8-bit.
cmp
/adc reg,0
is basically a peephole optimization for the b
elow / c
arry-set condition. AFAIK, there is nothing as efficient for signed-compare conditions. At best 3 uops for cmp/setcc/add, vs. 2 for cmp/adc. So unrolling to hide loop overhead is even more important.
See the bottom section of What is the best way to set a register to zero in x86 assembly: xor, mov or and? for more details about how to zero-extend SETCC r/m8
efficiently but without causing partial-register stalls. And see Why doesn't GCC use partial registers? for a reminder of partial-register behaviour across uarches.
Yes, CF is special for a lot of things. It's the only condition flag that has set/clear/complement (stc
/clc
/cmc
) instructions1. There's a reason that bt
/bts
/etc. instructions set CF, and that shift instructions shift into it. And yes, ADC/SBB can add/sub it directly into another register, unlike any other flag.
OF can be read similarly with ADOX (Intel since Broadwell, AMD since Ryzen), but that still doesn't help us because it's strictly OF, not the SF!=OF signed-less-than condition.
This is typical for most ISAs, not just x86. (AVR and some others can set/clear any condition flag because they have an instruction that takes an immediate bit-position in the status register. But they still only have ADC/SBB for directly adding the carry flag to an integer register.)
ARM 32-bit can do a predicated addlt r0, r0, #1
using any condition-code, including signed less-than, instead of an add-with-carry with immediate 0. ARM does have ADC-immediate which you could use for the C flag here, but not in Thumb mode (where it would be useful to avoid an IT instruction to predicate an ADD), so you'd need a zeroed register.
AArch64 can do a few predicated things, including increment with cinc
with arbitrary condition predicates.
But x86 can't. We only have cmovcc
and setcc
to turn conditions other than CF==1 into integers. (Or with ADOX, for OF==1
.)
Footnote 1: Some status flags in EFLAGS like interrupts IF (sti/cli
), direction DF (std
/cld
), and alignment-check (stac
/clac
) have set/clear instructions, but not the condition flags ZF/SF/OF/PF or the BCD-carry AF.
cmp [rdi + 4 * rsi], edx
will un-laminate even on Haswell/Skylake because of the indexed addressing mode, and it it doesn't have a read/write destination register (so it's not like add reg, [mem]
.)
If tuning only for Sandybridge-family, you might as well just increment a pointer and decrement the size counter. Although this does save back-end (unfused-domain) uops for RS-size effects.
In practice you'd want to unroll with a pointer increment.
You mentioned sizes from 0 to 32, so we need to skip the loop if RSI = 0. The code in your question is just a do{}while
which doesn't do that. NEG sets flags according to the result, so we can JZ on that. You'd hope that it could macro-fuse because NEG is exactly like SUB from 0, but according to Agner Fog it doesn't on SnB/IvB. So that costs us another uop in the startup if you really do need to handle size=0.
The standard way to implement integer += (a < b)
or any other flag condition is what compilers do (Godbolt):
xor edx,edx ; can be hoisted out of a short-running loop, but compilers never do that
; but an interrupt-handler will destroy the rdx=dl status
cmp/test/whatever ; flag-setting code here
setcc dl ; zero-extended to a full register because of earlier xor-zeroing
add eax, edx
Sometimes compilers (especially gcc) will use setcc dl
/ movzx edx,dl
, which puts the MOVZX on the critical path. This is bad for latency, and mov-elimination doesn't work on Intel CPUs when they use (part of) the same register for both operands.
For small arrays, if you don't mind having only an 8-bit counter, you could just use 8-bit add so you don't have to worry about zero-extension inside the loop.
; slower than cmp/adc: 5 uops per iteration so you'll definitely want to unroll.
; requires size<256 or the count will wrap
; use the add eax,edx version if you need to support larger size
count_signed_lt: ; (int *arr, size_t size, int key)
xor eax, eax
lea rdi, [rdi + rsi*4]
neg rsi ; we loop from -length to zero
jz .return ; if(-size == 0) return 0;
; xor edx, edx ; tmp destination for SETCC
.loop:
cmp [rdi + 4 * rsi], edx
setl dl ; false dependency on old RDX on CPUs other than P6-family
add al, dl
; add eax, edx ; boolean condition zero-extended into RDX if it was xor-zeroed
inc rsi
jnz .loop
.return:
ret
Alternatively using CMOV, making the loop-carried dep chain 2 cycles long (or 3 cycles on Intel before Broadwell, where CMOV is 2 uops):
;; 3 uops without any partial-register shenanigans, (or 4 because of unlamination)
;; but creates a 2 cycle loop-carried dep chain
cmp [rdi + 4 * rsi], edx
lea ecx, [rax + 1] ; tmp = count+1
cmovl eax, ecx ; count = arr[i]<key ? count+1 : count
So at best (with loop unrolling and a pointer-increment allowing cmp
to micro-fuse) this takes 3 uops per element instead of 2.
SETCC is a single uop, so this is 5 fused-domain uops inside the loop. That's much worse on Sandybridge/IvyBridge, and still runs at worse than 1 per clock on later SnB-family. (Some ancient CPUs had slow setcc, like Pentium 4, but it's efficient on everything we still care about.)
When unrolling, if you want this to run faster than 1 cmp
per clock, you have two choices: use separate registers for each setcc
destination, creating multiple dep chains for the false dependencies, or use one xor edx,edx
inside the loop to break the loop-carried false dependency into multiple short dep chains that only couple the setcc results of nearby loads (probably coming from the same cache line). You'll also need multiple accumulators because add
latency is 1c.
Obviously you'll need to use a pointer-increment so cmp [rdi], edx
can micro-fuse with a non-indexed addressing mode, otherwise the cmp/setcc/add is 4 uops total, and that's the pipeline width on Intel CPUs.
There's no partial-register stall from the caller reading EAX after writing AL, even on P6-family, because we xor-zeroed it first. Sandybridge won't rename it separately from RAX because add al,dl
is a read-modify-write, and IvB and later never rename AL separately from RAX (only AH/BH/CH/DH). CPUs other than P6 / SnB-family don't do partial-register renaming at all, only partial flags.
The same applies for the version that reads EDX inside the loop. But an interrupt-handler saving/restoring RDX with push/pop would destroy its xor-zeroed status, leading to partial-register stalls every iteration on P6-family. This is catastrophically bad, so that's one reason compilers never hoist the xor-zeroing. They usually don't know if a loop will be long-running or not, and won't take the risk. By hand, you'd probably want to unroll and xor-zero once per unrolled loop body, rather than once per cmp
/setcc
.
Both are baseline on x86-64. Since you're not gaining anything (on SnB-family) from folding the load into the cmp
, you might as well use a scalar movd
load into an XMM register. MMX has the advantage of smaller code-size, but requires EMMS when you're done. It also allows unaligned memory operands, so it's potentially interesting for simpler auto-vectorization.
Until AVX512, we only have comparison for greater-than available, so it would take an extra movdqa xmm,xmm
instruction to do key > arr[i]
without destroying key, instead of arr[i] > key
. (This is what gcc and clang do when auto-vectorizing).
AVX would be nice, for vpcmpgtd xmm0, xmm1, [rdi]
to do key > arr[i]
, like gcc and clang use with AVX. But that's a 128-bit load, and we want to keep it simple and scalar.
We can decrement key
and use (arr[i]<key)
= (arr[i] <= key-1)
= !(arr[i] > key-1)
. We can count elements where the array is greater-than key-1
, and subtract that from the size. So we can make do with just SSE2 without costing extra instructions.
If key
was already the most-negative number (so key-1
would wrap), then no array elements can be less than it. This does introduce a branch before the loop if that case is actually possible.
; signed version of the function in your question
; using the low element of XMM vectors
count_signed_lt: ; (int *arr, size_t size, int key)
; actually only works for size < 2^32
dec edx ; key-1
jo .key_eq_int_min
movd xmm2, edx ; not broadcast, we only use the low element
movd xmm1, esi ; counter = size, decrement toward zero on elements >= key
;; pxor xmm1, xmm1 ; counter
;; mov eax, esi ; save original size for a later SUB
lea rdi, [rdi + rsi*4]
neg rsi ; we loop from -length to zero
.loop:
movd xmm0, [rdi + 4 * rsi]
pcmpgtd xmm0, xmm2 ; xmm0 = arr[i] gt key-1 = arr[i] >= key = not less-than
paddd xmm1, xmm0 ; counter += 0 or -1
;; psubd xmm1, xmm0 ; -0 or -(-1) to count upward
inc rsi
jnz .loop
movd eax, xmm1 ; size - count(elements > key-1)
ret
.key_eq_int_min:
xor eax, eax ; no array elements are less than the most-negative number
ret
This should be the same speed as your loop on Intel SnB-family CPUs, plus a tiny bit of extra overhead outside. It's 4 fuse-domain uops, so it can issue at 1 per clock. A movd
load uses a regular load port, and there are at least 2 vector ALU ports that can run PCMPGTD and PADDD.
Oh, but on IvB/SnB the macro-fused inc/jnz requires port 5, while PCMPGTD / PADDD both only run on p1/p5, so port 5 throughput will be a bottleneck. On HSW and later the branch runs on port 6, so we're fine for back-end throughput.
It's worse on AMD CPUs where a memory-operand cmp can use an indexed addressing mode without a penalty. (And on Intel Silvermont, and Core 2 / Nehalem, where memory-source cmp can be a single uop with an indexed addressing mode.)
And on Bulldozer-family, a pair of integer cores share a SIMD unit, so sticking to integer registers could be an even bigger advantage. That's also why int<->XMM movd
/movq
has higher latency, again hurting this version.
Clang for PowerPC64 (included in the Godbolt link) shows us a neat trick: zero or sign-extend to 64-bit, subtract, and then grab the MSB of the result as a 0/1 integer that you add to counter
. PowerPC has excellent bitfield instructions, including rldicl
. In this case, it's being used to rotate left by 1, and then zero all bits above that, i.e. extracting the MSB to the bottom of another register. (Note that PowerPC documentation numbers bits with MSB=0, LSB=63 or 31.)
If you don't disable auto-vectorization, it uses Altivec with a vcmpgtsw
/ vsubuwm
loop, which I assume does what you'd expect from the names.
# PowerPC64 clang 9-trunk -O3 -fno-tree-vectorize -fno-unroll-loops -mcpu=power9
# signed int version
# I've added "r" to register names, leaving immediates alone, because clang doesn't have `-mregnames`
... setup
.LBB0_2: # do {
lwzu r5, 4(r6) # zero-extending load and update the address register with the effective-address. i.e. pre-increment
extsw r5, r5 # sign-extend word (to doubleword)
sub r5, r5, r4 # 64-bit subtract
rldicl r5, r5, 1, 63 # rotate-left doubleword immediate then clear left
add r3, r3, r5 # retval += MSB of (int64_t)arr[i] - key
bdnz .LBB0_2 # } while(--loop_count);
I think clang could have avoided the extsw
inside the loop if it had used an arithmetic (sign-extending) load. The only lwa
that updates the address register (saving an increment) seems to be the indexed form lwaux RT, RA, RB
, but if clang put 4
in another register it could use it. (There doesn't seem to be a lwau
instruction.) Maybe lwaux
is slow or maybe it's a missed optimization. I used -mcpu=power9
so even though that instruction is POWER-only, it should be available.
This trick could sort of help for x86, at least for a rolled-up loop. It takes 4 uops this way per compare, not counting loop overhead. Despite x86's pretty bad bitfield extract capabilities, all we actually need is a logical right-shift to isolate the MSB.
count_signed_lt: ; (int *arr, size_t size, int key)
xor eax, eax
movsxd rdx, edx
lea rdi, [rdi + rsi*4]
neg rsi ; we loop from -length to zero
.loop:
movsxd rcx, dword [rdi + 4 * rsi] ; 1 uop, pure load
sub rcx, rdx ; (int64_t)arr[i] - key
shr rcx, 63 ; extract MSB
add eax, ecx ; count += MSB of (int64_t)arr[i] - key
inc rsi
jnz .loop
ret
This doesn't have any false dependencies, but neither does 4-uop xor
-zero / cmp
/ setl
/ add
. The only advantage here is that this is 4 uops even with an indexed addressing mode. Some AMD CPUs may run MOVSXD through an ALU as well as a load port, but Ryzen has the same latency as for it as for regular loads.
If you have fewer than 64 iterations, you could do something like this if only throughput matters, not latency. (But you can probably still do better with setl
)
.loop
movsxd rcx, dword [rdi + 4 * rsi] ; 1 uop, pure load
sub rcx, rdx ; (int64_t)arr[i] - key
shld rax, rcx, 1 ; 3 cycle latency
inc rsi / jnz .loop
popcnt rax, rax ; turn the bitmap of compare results into an integer
But the 3-cycle latency of shld
makes this a showstopper for most uses, even though it's only a single uop on SnB-family. The rax->rax dependency is loop-carried.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With