I linked some assembly with some c to test the cost of a function call, with the following assembly and c source (using fasm and gcc respectively)
assembly:
format ELF
public no_call as "_no_call"
public normal_call as "_normal_call"
section '.text' executable
iter equ 100000000
no_call:
mov ecx, iter
@@:
push ecx
pop ecx
dec ecx
cmp ecx, 0
jne @b
ret
normal_function:
ret
normal_call:
mov ecx, iter
@@:
push ecx
call normal_function
pop ecx
dec ecx
cmp ecx, 0
jne @b
ret
c source:
#include <stdio.h>
#include <time.h>
extern int no_call();
extern int normal_call();
int main()
{
clock_t ct1, ct2;
ct1 = clock();
no_call();
ct2 = clock();
printf("\n\n%d\n", ct2 - ct1);
ct1 = clock();
normal_call();
ct2 = clock();
printf("%d\n", ct2 - ct1);
return 0;
}
The results I got were suprising. First of all, the speed depended on the order in which I linked mattered. If I linked as gcc intern.o extern.o
, a typical output is
162
181
But linking in the opposite order gcc extern.o intern.o
, I got an output more like:
162
130
That they are different was very suprising but is not the question I am asking. (relevant question here)
The question I am asking is how is it that in the second run the loop with the function call was faster than the loop without one, how was the cost of calling a function apparently negative.
Edit: Just to make mention of some of the things tried in the comments:
162.168
and 131.578
seconds for my run timesAdditionally, after modifying the assembly code to align on bytes, I tested giving the set of functions an additional offset and came to some more strange conclusions. Here is the updated code:
format ELF
public no_call as "_no_call"
public normal_call as "_normal_call"
section '.text' executable
iter equ 100000000
offset equ 23 ; this is the number I am changing
times offset nop
times 16 nop
no_call:
mov ecx, iter
no_call.loop_start:
push ecx
pop ecx
dec ecx
cmp ecx, 0
jne no_call.loop_start
ret
times 55 nop
normal_function:
ret
times 58 nop
normal_call:
mov ecx, iter
normal_call.loop_start:
push ecx
call normal_function
pop ecx
dec ecx
cmp ecx, 0
jne normal_call.loop_start
ret
I had to manually (and non portably) force the 64 byte alignment since FASM does not support more than 4 byte alignment for the executable section, at least on my machine. Offsetting the program by offset
bytes, here is what I found.
if (20 <= offset mod 128 <= 31) then we get an output of (approximately):
162
131
else
162 (+/- 10)
162 (+/- 10)
Not sure at all what to make of it, but that's what I've discovered so far
Edit 2:
Another thing I noticed is that if you remove push ecx
and pop ecx
from both functions, the output becomes
30
125
which indicates that that is the most expensive part of it. The stack alignment is the same both times, so that is not the reason for the discrepency. My best guess is that somehow the hardware is optimized to expect a call after a push or something similar, but I do not know of anything like that
while loops are not faster than for loops, however if you (re)use the loop counter a lot in the code, then you may find a marginal speed up by iterating from N to 0 rather than the other way round. Agreed at profiling and using results of profiling.
For loop (forward and reverse) The traditional for loop is the fastest, so you should always use that right? Not so fast - performance is not the only thing that matters. Code Readability is usually more important, so default to the style that fits your application.
Update: Skylake store/reload latency is as low as 3c, but only if the timing is right. Consecutive loads involved in a store-forwarding dependency chain that are naturally spaced out by 3 or more cycles will experience the faster latency (e.g. with 4 imul eax,eax
in the loop, mov [rdi], eax
/ mov eax, [rdi]
only takes the cycle count up from 12 to 15 cycles per iteration.) but when the loads are allow to execute more densely than that, some type of contention is suffered and you get about 4.5 cycles per iteration. The non-integer average throughput is also a big clue that there's something unusual.
I saw the same effect for 32B vectors (best case 6.0c, back-to-back 6.2 to 6.9c), but 128b vectors were always around 5.0c. See details on Agner Fog's forum.
Update2: Adding a redundant assignment speeds up code when compiled without optimization and a 2013 blog post indicate that this effect is present on all Sandybridge-family CPUs.
The back-to-back (worst case) store-forwarding latency on Skylake is 1 cycle better than on previous uarches, but the variability when the load can't execute right away is similar.
With the right (mis-)alignment, the extra call
in the loop can actually help Skylake observe lower store-forwarding latency from push to pop. I was able to reproduce this with perf counters (Linux perf stat -r4
), using YASM. (I've heard it's less convenient to use perf counters on Windows, and I don't have a Windows dev machine anyway. Fortunately the OS isn't really relevant to the answer; anyone should be able to reproduce my perf-counter results on Windows with VTune or something.)
I saw the faster times at offset = 0..10, 37, 63-74, 101, and 127 following an align 128
at the spot specified in the question. L1I cache lines are 64B, and the uop-cache is cares about 32B boundaries. It looks alignment relative to a 64B boundary is all that matters.
The no-call loop is a steady 5 cycles always, but the call
loop can get down to 4c per iteration from its usual almost-exactly-5 cycles. I saw slower-than-usual performance at offset=38 (5.68 +- 8.3% cycles per iteration). There are small glitches at other points, like 5.17c +- 3.3%, according to perf stat -r4
(which does 4 runs and averaging).
It seems to be an interaction between the front-end not queueing up so many uops ahead, causing the back end to have lower latency for store-forwarding from push to pop.
IDK if reusing the same address repeatedly for store-forwarding makes it slower (with multiple store-address uops already executed ahead of the corresponding store-data uops), or what.
Test code: bash
shell loop to build & profile the asm with every different offset:
(set -x; for off in {0..127};do
asm-link -m32 -d call-tight-loop.asm -DFUNC=normal_call -DOFFSET=$off &&
ocperf.py stat -etask-clock,context-switches,cpu-migrations,page-faults:u,cycles,instructions,uops_issued.any,uops_executed.thread,idq.mite_uops,dsb2mite_switches.penalty_cycles -r4 ./call-tight-loop;
done ) |& tee -a call-tight-loop.call.offset-log
(set -x)
in a subshell is a handy way to log commands along with their output when redirecting to a log file.
asm-link
is a script that runs yasm -felf32 -Worphan-labels -gdwarf2 call-tight-loop.asm "$@" && ld -melf_i386 -o call-tight-loop call-tight-loop.o
, then runs objdumps -drwC -Mintel
on the result.
NASM / YASM Linux test program (assembles into a complete static binary that runs the loop and then exits, so you can profile the whole program.) Direct port of the OP's FASM source, with no optimizations to the asm.
CPU p6 ; YASM directive. For NASM, %use smartalign.
section .text
iter equ 100000000
%ifndef OFFSET
%define OFFSET 0
%endif
align 128
;;offset equ 23 ; this is the number I am changing
times OFFSET nop
times 16 nop
no_call:
mov ecx, iter
.loop:
push ecx
pop ecx
dec ecx
cmp ecx, 0
jne .loop
ret
times 55 nop
normal_function:
ret
times 58 nop
normal_call:
mov ecx, iter
.loop:
push ecx
call normal_function
pop ecx
dec ecx
cmp ecx, 0
jne .loop
ret
%ifndef FUNC
%define FUNC no_call
%endif
align 64
global _start
_start:
call FUNC
mov eax,1 ; __NR_exit from /usr/include/asm/unistd_32.h
xor ebx,ebx
int 0x80 ; sys_exit(0), 32-bit ABI
Sample output from a fast call
run:
+ asm-link -m32 -d call-tight-loop.asm -DFUNC=normal_call -DOFFSET=3
...
080480d8 <normal_function>:
80480d8: c3 ret
...
08048113 <normal_call>:
8048113: b9 00 e1 f5 05 mov ecx,0x5f5e100
08048118 <normal_call.loop>:
8048118: 51 push ecx
8048119: e8 ba ff ff ff call 80480d8 <normal_function>
804811e: 59 pop ecx
804811f: 49 dec ecx
8048120: 83 f9 00 cmp ecx,0x0
8048123: 75 f3 jne 8048118 <normal_call.loop>
8048125: c3 ret
...
Performance counter stats for './call-tight-loop' (4 runs):
100.646932 task-clock (msec) # 0.998 CPUs utilized ( +- 0.97% )
0 context-switches # 0.002 K/sec ( +-100.00% )
0 cpu-migrations # 0.000 K/sec
1 page-faults:u # 0.010 K/sec
414,143,323 cycles # 4.115 GHz ( +- 0.56% )
700,193,469 instructions # 1.69 insn per cycle ( +- 0.00% )
700,293,232 uops_issued_any # 6957.919 M/sec ( +- 0.00% )
1,000,299,201 uops_executed_thread # 9938.695 M/sec ( +- 0.00% )
83,212,779 idq_mite_uops # 826.779 M/sec ( +- 17.02% )
5,792 dsb2mite_switches_penalty_cycles # 0.058 M/sec ( +- 33.07% )
0.100805233 seconds time elapsed ( +- 0.96% )
You push/pop your loop counter, so everything except the call
and ret
instructions (and the cmp
/jcc
) are part of the critical path loop-carried dependency chain involving the loop counter.
You'd expect that pop
would have to wait for updates to the stack pointer by call
/ret
, but the stack engine handles those updates with zero latency. (Intel since Pentium-M, AMD since K10, according to Agner Fog's microarch pdf, so I'm assuming your CPU has one, even though you didn't say anything about what CPU microarchitecture you ran your tests on.)
The extra call
/ret
still need to execute, but out-of-order execution can keep the critical path instructions running at their max throughput. Since this includes the latency of a store->load forwarding from push/pop + 1 cycle for dec
, this is not high throughput on any CPU, and it's a surprise that the front-end can ever be a bottleneck with any alignment.
push
->pop
latency is 5 cycles on Skylake, according to Agner Fog, so on that uarch your loop can only run at best one iteration per 6 cycles.
This is plenty of time for out-of-order-execution to run the call
and ret
instructions. Agner lists a max throughput for call
of one per 3 cycles, and ret
at one per 1 cycle. Or on AMD Bulldozer, 2 and 2. His tables don't list anything about the throughput of a call
/ret
pair, so IDK whether those can overlap or not. On AMD Bulldozer, store/reload latency with mov
is 8 cycles. I assume it's about the same with push/pop.
It seems that different alignments for the top of the loop (i.e. no_call.loop_start:
) are causing front-end bottlenecks. The call
version has 3 branches per iteration: the call, the ret, and the loop-branch. Note that the ret
's branch target is the instruction right after the call
. Each of these potentially disrupts the front-end. Since you're seeing an actual slow-down in practice, we must be seeing more than 1 cycle delay per branch. Or for the no_call version, a single fetch/decode bubble worse than about 6 cycles, leading to an actual wasted cycle in issuing uops into the out-of-order part of the core. That's weird.
It's too complicated to guess about what the actual microarchitectural details are for every possible uarch, so let us know what CPU you tested on.
I will mention though that push
/pop
inside a loop on Skylake stops it from issuing from the Loop Stream Detector, and has to be re-fetched from the uop cache every time. Intel's optimization manual says that for Sandybridge, a mismatched push/pop inside a loop stops it from using the LSD. That implies that it can use the LSD for loops with balanced push/pop. In my testing, that's not the case on Skylake (using the lsd.uops
performance counter), but I haven't seen any mention of whether that was a change, or whether SnB was actually like that, too.
Also, unconditional branches always end a uop-cache line. It's possible that with normal_function:
in the same naturally-aligned 32B chunk of machine code as the call
and jne
, maybe the block of code doesn't fit in the uop cache. (Only 3 uop-cache lines can cache decoded uops for a single 32B chunk of x86 code). But that wouldn't explain the possibility of problems for the no_call loop, so you're probably not running on an Intel SnB-family microarchitecture.
(Update, yes, the loop sometimes runs mostly from legacy decode (idq.mite_uops
), but usually not exclusively. dsb2mite_switches.penalty_cycles
is usually ~8k, and probably only happens on timer interrupts. The runs where the call
loop runs faster seem to be correlated with lower idq.mite_uops
, but it's still 34M +- 63% for the offset=37 case where the 100M iterations took 401M cycles.)
This is really one of those "don't do that" cases: inline tiny functions instead of calling them from inside very tight loops.
You might see different results if you push
/pop
a register other than your loop counter. That would separate the push/pop from the loop counter, so there'd be 2 separate dependency chains. It should speed up both the call and no_call versions, but maybe no equally. It could just make a front-end bottleneck more obvious.
You should see a huge speedup if you push edx
but pop eax
, so the push/pop instructions don't form a loop-carried dependency chain. Then the extra call
/ret
would definitely be a bottleneck.
Side-note: dec ecx
already sets ZF the way you want, so you could have just used dec ecx / jnz
. Also, cmp ecx,0
is less efficient than test ecx,ecx
(larger code size and can't macro-fuse on as many CPUs). Anyway, totally irrelevant to the question about relative performance of your two loops. (Your lack of an ALIGN
directive between functions means that changing the first one would have changed the alignment of the loop branch in the 2nd, but you already explored different alignments.)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With