Consider the following x86-64 assembly:
inner:
...
ret
outer:
.top:
call inner
dec rdi
jnz .top
ret
The function outer
simply repeatedly makes a call
to the function inner
(whose body isn't shown - it may be empty).
Does the series of call
instructions in outer
, and the corresponding ret
instructions inside inner
form a dependent chain in practice (for the purposes of estimating performance)?
There is more than one way this chain could be formed. For example, does the ret
depend on the latency of the preceding call
instruction and then does the subsequent call
instruction depend on the ret
, forming a call -> ret -> call
chain? Or perhaps the ret
is independent but the call
is not, forming a call -> call
chain? If there is a chain, is it through memory, a register, the stack engine, the return address predictor1, or what?
Motivation: This question originated from a series of comments on another question, mostly this comment and earlier ones.
1 The terminology might be somewhat unclear here: the stack engine is normally understood to handle transforming rsp
-modifying instructions into a single access with an appropriate offset, so that push rax; push rbx
might be transformed into something like mov [t0], rax; mov [t0 - 8], rbx
where t0
is some temporary register that captured the value of rsp
at some point. It also understood to handle a similar transformation for call
and ret
instructions, which both modify the stack in a way similar to push
and pop
as well as including a direct, indirect (respectively) jump. The CPU also includes a mechanism to predict that return indirect jump, which some lump under "stack engine" - but here I'm separating that out into "return address predictor".
No, branch-prediction + speculative execution break the store/reload dependency.
RIP is (speculatively) known by the front-end, from the return-address predictor. The next call
instruction can thus push a return address without waiting for the ret
to execute (and actually load and verity the correctness of the predicted return address against the data from the stack).
Speculative stores can enter the store buffer and be store-forwarded.
There is of course a dependency chain, it's not loop-carried. Out-of-order execution hides it by keeping many iterations in flight.
Proof: call
's store breaks what would otherwise be a loop-carried memory dependency chain.
align 64
global _start
_start:
mov ebp, 250000000 ; I had been unrolling by 4, should have changed this to 5000... before measuring, but forgot.
align 32
.mainloop:
call delay_retaddr
call delay_retaddr
dec ebp
jg .mainloop
xor edi,edi
mov eax,231 ; __NR_exit_group from /usr/include/asm/unistd_64.h
syscall ; sys_exit_group(0)
;; Placing this function *before* _start, or increasing the alignment,
;; makes it somewhat slower!
align 32
delay_retaddr:
add qword [rsp], 0
add qword [rsp], 0 ; create latency for the ret addr
ret
Assemble and link with yasm -felf64 -Worphan-labels -gdwarf2 foo.asm && ld -o foo foo.o
, producing a static ELF binary.
Profiled (on an i7-6700k) with ocperf.py, I get 0.99 instructions per core clock cycle:
$ taskset -c 3 ocperf.py stat -etask-clock,context-switches,cpu-migrations,page-faults,cycles,branches,instructions,uops_issued.any,uops_executed.thread,dsb2mite_switches.penalty_cycles -r2 ./foo
Performance counter stats for './foo' (2 runs):
645.770390 task-clock (msec) # 1.000 CPUs utilized ( +- 0.05% )
1 context-switches # 0.002 K/sec ( +-100.00% )
0 cpu-migrations # 0.000 K/sec
2 page-faults # 0.004 K/sec ( +- 20.00% )
2,517,412,984 cycles # 3.898 GHz ( +- 0.09% )
1,250,159,413 branches # 1935.919 M/sec ( +- 0.00% )
2,500,838,090 instructions # 0.99 insn per cycle ( +- 0.00% )
4,010,093,750 uops_issued_any # 6209.783 M/sec ( +- 0.03% )
7,010,150,784 uops_executed_thread # 10855.485 M/sec ( +- 0.02% )
62,838 dsb2mite_switches_penalty_cycles # 0.097 M/sec ( +- 30.92% )
0.645899414 seconds time elapsed ( +- 0.05% )
With the called function before _start
, and alignment values of 128
, IPC can go down from 0.99 to 0.84, which is super-weird. Counts for dsb2mite switches are still low-ish, so it's mostly still running from the uop cache, not the legacy decoders. (This Skylake CPU has the microcode update that disables the loop buffer, in case that would be relevant with all this jumping.)
To sustain good throughput, the CPU has to keep many iterations of the inner loop in flight because we've significantly lengthened the independent dep chains that need to overlap.
Changing the add [rsp], 0
instructions to [rsp+16]
creates a loop-carried dependency chain on a different location, which isn't being stored to by call
. So the loop bottlenecks on that store-forwarding latency and runs at ~half speed.
# With add qword [rsp+16], 0
Performance counter stats for './foo' (2 runs):
1212.339007 task-clock (msec) # 1.000 CPUs utilized ( +- 0.04% )
2 context-switches # 0.002 K/sec ( +- 60.00% )
0 cpu-migrations # 0.000 K/sec
2 page-faults # 0.002 K/sec
4,727,361,809 cycles # 3.899 GHz ( +- 0.02% )
1,250,292,058 branches # 1031.306 M/sec ( +- 0.00% )
2,501,537,152 instructions # 0.53 insn per cycle ( +- 0.00% )
4,026,138,227 uops_issued_any # 3320.967 M/sec ( +- 0.02% )
7,026,457,222 uops_executed_thread # 5795.786 M/sec ( +- 0.01% )
230,287 dsb2mite_switches_penalty_cycles # 0.190 M/sec ( +- 68.23% )
1.212612110 seconds time elapsed ( +- 0.04% )
Note that I'm still using an RSP-relative address so there's still a stack-sync uop. I could have kept both cases the same and avoided it in both by using an address relative to a different register (e.g. rbp
) to address the location where call
/ret
store/reload the return address.
I don't think the variable latency of store-forwarding (worse in simple back-to-back reload right away cases) is sufficient to explain the difference. Adding a redundant assignment speeds up code when compiled without optimization. This is a factor of 2 speedup from breaking the dependency. (0.99 IPC vs. 0.53 IPC, with the same instructions just different addressing mode.)
The instructions are 1 byte longer with the disp8
in the addressing mode, and there was front-end weirdness with alignment in the faster version, but moving things around doesn't seem to change anything with the [rsp+16]
version.
Using a version that creates a store-forwarding stall (with add dword [rsp], 0
) makes the dep chain too long for OoO exec to hide easily. I didn't play around with this a huge amount.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With