A trivial function I'm compiling with gcc and clang:
void test() {
printf("hm");
printf("hum");
}
$ gcc test.c -fomit-frame-pointer -masm=intel -O3 -S
sub rsp, 8
.cfi_def_cfa_offset 16
mov esi, OFFSET FLAT:.LC0
mov edi, 1
xor eax, eax
call __printf_chk
mov esi, OFFSET FLAT:.LC1
mov edi, 1
xor eax, eax
add rsp, 8
.cfi_def_cfa_offset 8
jmp __printf_chk
And
$ clang test.c -mllvm --x86-asm-syntax=intel -fomit-frame-pointer -O3 -S
# BB#0:
push rax
.Ltmp1:
.cfi_def_cfa_offset 16
mov edi, .L.str
xor eax, eax
call printf
mov edi, .L.str1
xor eax, eax
pop rdx
jmp printf # TAILCALL
The difference I'm interested in is that gcc uses sub rsp, 8
/add rsp, 8
for the function prolog and clang uses push rax
/pop rdx
.
Why does the compilers use different function prologues? Which variant is better? push
and pop
certainly encodes to shorter instructions but are they faster or slower than add
and sub
?
The reason for the stack fiddling in the first place seems to be that the abi requires rsp to be 16 bytes aligned for non leaf procedures. I haven't been able to find any compiler flags that removes them.
Judging from your answers, it seems like push & pop is better. push rax + pop rdx = 1 + 1 = 2
vs. sub rsp, 8 + add rsp, 8 = 4 + 4 = 8
. So the former pair saves 6 bytes at no expense.
On Intel, sub
/ add
will trigger the stack engine to insert an extra uop to synchronize %rsp
for the out-of-order execution part of the pipeline. (See Agner Fog's microarch doc, specifically pg 91, about the stack engine. AFAIK, it still works the same on Haswell as on Pentium M, as far as when it needs to insert extra uops.
The push
/ pop
will take fewer fused-domain uops, and so probably be more efficient even though they use the store/load ports. They come between call/ret pairs.
So, push
/ pop
is at least not going to be slower, but takes fewer instruction bytes. Better I-cache density is good.
BTW, I think the point of the pair of insns is to keep the stack 16B-aligned after call
pushes the 8B return address. This is one case where the ABI ends up requiring semi-useless instructions. More complex functions that need some stack space to spill locals, and then reload them after function calls, will typically sub $something, %rsp
to reserve space.
The SystemV (Linux) amd64 ABI guarantees that at function entry, (%rsp + 8)
, where args on the stack will be, if there are any, will be 16B aligned. (http://x86-64.org/documentation/abi.pdf). You have to arrange for that to be the case for any function you call, or it's your fault if they segfault from using an SSE aligned load. Or otherwise crash from making assumptions about how they can use AND
to mask an address or something.
According to the experiments I did on my machine, push/pop
are of the same speed as add/sub
. I guess it should be the case for all mordern computers.
Anyway, the difference (if any) is really micro-scopic, so I suggest you safely assume that they are equivalent...
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With