ADC on Haswell and earlier is normally 2 uops, with 2 cycle latency, because Intel uops traditionally could only have 2 inputs (https://agner.org/optimize/). Broadwell / Skylake and later have single-uop ADC/SBB/CMOV, after Haswell introduced 3-input uops for FMA and micro-fusion of indexed addressing modes in some cases.
(But BDW/SKL still uses 2 uops for the adc al, imm8
short-form encoding, or the other al/ax/eax/rax, imm8/16/32/32 short forms with no ModRM. More details in my answer.)
But adc
with immediate 0 is special-cased on Haswell to decode as only a single uop. @BeeOnRope tested this, and included a check for this performance quirk in his uarch-bench: https://github.com/travisdowns/uarch-bench. Sample output from CI on a Haswell server showing a difference between adc reg,0
and adc reg,1
or adc reg,zeroed-reg
.
(But only for 32 or 64-bit operand-size, not adc bl,0
. So use 32-bit when using adc on a setcc result to combine 2 conditions into one branch.)
Same for SBB. As far as I've seen, there's never any difference between ADC and SBB performance on any CPU, for the equivalent encoding with the same immediate value.
When was this optimization for imm=0
introduced?
I tested on Core 21, and found that adc eax,0
latency is 2 cycles, same as adc eax,3
. And also the cycle count is identical for a few variations of throughput tests with 0
vs. 3
, so first-gen Core 2 (Conroe/Merom) doesn't do this optimization.
The easiest way to answer this is probably to use my test program below on a Sandybridge system, and see if adc eax,0
is faster than adc eax,1
. But answers based on reliable documentation would be fine, too.
Footnote 1: I used this test program on my Core 2 E6600 (Conroe / Merom), running Linux.
;; NASM / YASM
;; assemble / link this into a 32 or 64-bit static executable.
global _start
_start:
mov ebp, 100000000
align 32
.loop:
xor ebx,ebx ; avoid partial-flag stall but don't break the eax dependency
%rep 5
adc eax, 0 ; should decode in a 2+1+1+1 pattern
add eax, 0
add eax, 0
add eax, 0
%endrep
dec ebp ; I could have just used SUB here to avoid a partial-flag stall
jg .loop
%ifidn __OUTPUT_FORMAT__, elf32
;; 32-bit sys_exit would work in 64-bit executables on most systems, but not all. Some, notably Window's subsystem for Linux, disable IA32 compat
mov eax,1
xor ebx,ebx
int 0x80 ; sys_exit(0) 32-bit ABI
%else
xor edi,edi
mov eax,231 ; __NR_exit_group from /usr/include/asm/unistd_64.h
syscall ; sys_exit_group(0)
%endif
Linux perf
doesn't work very well on old CPUs like Core 2 (it doesn't know how to access all the events like uops), but it does know how to read the HW counters for cycles and instructions. That's sufficient.
I built and profiled this with
yasm -felf64 -gdwarf2 testloop.asm
ld -o testloop-adc+3xadd-eax,imm=0 testloop.o
# optional: taskset pins it to core 1 to avoid CPU migrations
taskset -c 1 perf stat -e task-clock,context-switches,cycles,instructions ./testloop-adc+3xadd-eax,imm=0
Performance counter stats for './testloop-adc+3xadd-eax,imm=0':
1061.697759 task-clock (msec) # 0.992 CPUs utilized
100 context-switches # 0.094 K/sec
2,545,252,377 cycles # 2.397 GHz
2,301,845,298 instructions # 0.90 insns per cycle
1.069743469 seconds time elapsed
0.9 IPC is the interesting number here.
This is about what we'd expect from static analysis with a 2 uop / 2c latency adc
: (5*(1+3) + 3) = 23
instructions in the loop, 5*(2+3) = 25
cycles of latency = cycles per loop iteration. 23/25 = 0.92.
It's 1.15 on Skylake. (5*(1+3) + 3) / (5*(1+3)) = 1.15
, i.e. the extra .15 is from the xor-zero and dec/jg while the adc/add chain runs at exactly 1 uop per clock, bottlenecked on latency. We'd expect this 1.15 overall IPC on any other uarch with single-cycle latency adc
, too, because the front-end isn't a bottleneck. (In-order Atom and P5 Pentium would be slightly lower, but xor and dec can pair with adc or add on P5.)
On SKL, uops_issued.any
= instructions
= 2.303G, confirming that adc
is single uop (which it always is on SKL, regardless of what value the immediate has). By chance, jg
is the first instruction in a new cache line so it doesn't macro-fuse with dec
on SKL. With dec rbp
or sub ebp,1
instead, uops_issued.any
is the expected 2.2G.
This is extremely repeatable: perf stat -r5
(to run it 5 times and show average + variance), and multiple runs of that, showed the cycle count was repeatable to 1 part in 1000. 1c vs. 2c latency in adc
would make a much bigger difference than that.
Rebuilding the executable with an immediate other than 0
doesn't change the timing at all on Core 2, another strong sign that there's no special case. That's definitely worth testing.
I was initially looking at throughput (with xor eax,eax
before each loop iteration, letting OoO exec overlap iterations), but it was hard to rule out front-end effects. I think I finally did avoid a front-end bottleneck by adding single-uop add
instructions. The throughput-test version of the inner loop looks like this:
xor eax,eax ; break the eax and CF dependency
%rep 5
adc eax, 0 ; should decode in a 2+1+1+1 pattern
add ebx, 0
add ecx, 0
add edx, 0
%endrep
That's why the latency-test version looks kinda weird. But anyway, remember that Core2 doesn't have a decoded-uop cache, and its loop buffer is in the pre-decode stage (after finding instruction boundaries). Only 1 of the 4 decoders can decode multi-uop instructions, so adc
being multi-uop bottlenecks on the front-end. I guess I could have just let that happen, with times 5 adc eax, 0
, since it's unlikely that some later stage of the pipeline would be able to throw out that uop without executing it.
Nehalem's loop buffer recycles decoded uops, and would avoid that decode bottleneck for back-to-back multi-uop instructions.
According to my microbenchmarks, the results of which can be found on uops.info, this optimization was introduced with Sandy Bridge (https://www.uops.info/html-tp/SNB/ADC_R64_0-Measurements.html). Westmere does not do this optimization (https://uops.info/html-tp/WSM/ADC_R64_0-Measurements.html). The data was obtained using a Core i7-2600, and a Core i5-650.
Furthermore, the data on uops.info shows that the optimization is not performed if an 8-bit register is used (Sandy Bridge, Ivy Bridge, Haswell).
It's not present on Nehalem, but is on IvyBridge. So it was new either in Sandybridge or IvB.
My guess is Sandybridge for this, because that was a major redesign of the decoders (producing up to 4 total uops, rather than patterns like 4+1+1+1 that were possible in Core2 / Nehalem), and hanging on to instructions that can macro-fuse (like add
or sub
) if they're the last in a group in case the next instruction is a jcc
.
Significantly for this, I think SnB decoders also look at the imm8 in immediate-count shifts to check if it's zero, instead of only doing that in the execution units2.
Hard data so far:
adc r,imm
and adc r,r
are always 1 uop, except for the AL/AX/EAX/RAX imm short form1 on Broadwell/Skylake.adc reg,0
is 1 uop, adc reg,1
is 2. For 32 and 64-bit operand-size, not 8-bit.
adc
is slower than add
regardless of the imm.Footnote 1: On Skylake, the al/ax/eax/rax, imm8/16/32/32 short-form encodings with no ModR/M byte still decode to 2 uops, even when the immediate is zero. For example, adc eax, strict dword 0
(15 00 00 00 00
) is twice as slow as 83 d0 00
. Both uops are on the critical path for latency.
Looks like Intel forgot to update the decoding for the other immediate forms of adc
and sbb
! (This all applies equally to both ADC and SBB.)
Assemblers will use the short-form by default for immediates that don't fit in an imm8, so for example adc rax, 12345
assembles to 48 15 39 30 00 00
instead of the one-byte larger single-uop form that is the only option for registers other than the accumulator.
A loop that bottlenecks on adc rcx, 12345
instead of RAX latency runs twice as fast. But adc rax, 123
is unaffected, because it uses the adc r/m64, imm8
encoding which is single uop.
Footnote 2: See INC instruction vs ADD 1: Does it matter? for quotes from Intel's optimization manual about Core2 stalling the front-end if a later instruction reads flags from a shl r/m32, imm8
, in case the imm8 was 0. (As opposed to the implicit-1 opcode, which the decoder knows always writes flags.)
But SnB-family doesn't do that; the decoder apparently checks the imm8 to see whether the instruction writes flags unconditionally or whether it leaves them untouched. So checking an imm8 is something that SnB decoders already do, and could usefully do for adc
to omit the uop that adds that input, leaving only adding CF to the destination.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With