Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why do C compilers still prefer push over mov for saving registers, even when mov appears faster in llvm-mca?

I noticed that modern C compilers typically use push instructions to save caller-saved registers, rather than explicit mov + sub sequences. However, based on llvm-mca simulations, the mov approach appears more efficient in some cases. Why is this?

On Skylake, push shows a clear dependency chain in llvm-mca. I also tested this on AMD’s Zen 5 (znver5), where push performance approaches mov but remains slower, with the same dependency chain behavior.

I analyzed two ways to save caller-saved registers:

  1. push sequence: 12x push instructions.
  2. mov sequence: 12x mov to stack + 1x sub rsp, 96.

Using llvm-mca (Skylake model), the mov version shows better throughput:

  • Total cycles: 27 (push) vs 15 (mov).
  • uOps: 36 (push) vs 13 (mov).
  • IPC: 0.44 (push) vs 0.87 (mov).

llvm-mca -mcpu=skylake -timeline -iterations=1 test_push.s -o test_push.txt

test_push.txt

Iterations:        1
Instructions:      12
Total Cycles:      27
Total uOps:        36

Dispatch Width:    6
uOps Per Cycle:    1.33
IPC:               0.44
Block RThroughput: 12.0


Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)

[1]    [2]    [3]    [4]    [5]    [6]    Instructions:
 3      2     1.00           *            pushq %rax
 3      2     1.00           *            pushq %rbx
 3      2     1.00           *            pushq %rcx
 3      2     1.00           *            pushq %rdx
 3      2     1.00           *            pushq %r8
 3      2     1.00           *            pushq %r9
 3      2     1.00           *            pushq %r10
 3      2     1.00           *            pushq %r11
 3      2     1.00           *            pushq %r12
 3      2     1.00           *            pushq %r13
 3      2     1.00           *            pushq %r14
 3      2     1.00           *            pushq %r15


Resources:
[0]   - SKLDivider
[1]   - SKLFPDivider
[2]   - SKLPort0
[3]   - SKLPort1
[4]   - SKLPort2
[5]   - SKLPort3
[6]   - SKLPort4
[7]   - SKLPort5
[8]   - SKLPort6
[9]   - SKLPort7


Resource pressure per iteration:
[0]    [1]    [2]    [3]    [4]    [5]    [6]    [7]    [8]    [9]    
 -      -     3.00   3.00   4.00   4.00   12.00  3.00   3.00   4.00   

Resource pressure by instruction:
[0]    [1]    [2]    [3]    [4]    [5]    [6]    [7]    [8]    [9]    Instructions:
 -      -      -      -      -      -     1.00    -     1.00   1.00   pushq %rax
 -      -      -      -      -     1.00   1.00   1.00    -      -     pushq %rbx
 -      -      -     1.00   1.00    -     1.00    -      -      -     pushq %rcx
 -      -     1.00    -      -      -     1.00    -      -     1.00   pushq %rdx
 -      -      -      -      -     1.00   1.00    -     1.00    -     pushq %r8
 -      -      -      -     1.00    -     1.00   1.00    -      -     pushq %r9
 -      -      -     1.00    -      -     1.00    -      -     1.00   pushq %r10
 -      -     1.00    -      -     1.00   1.00    -      -      -     pushq %r11
 -      -      -      -     1.00    -     1.00    -     1.00    -     pushq %r12
 -      -      -      -      -      -     1.00   1.00    -     1.00   pushq %r13
 -      -      -     1.00    -     1.00   1.00    -      -      -     pushq %r14
 -      -     1.00    -     1.00    -     1.00    -      -      -     pushq %r15


Timeline view:
                    0123456789       
Index     0123456789          0123456

[0,0]     DeeER.    .    .    .    ..   pushq   %rax
[0,1]     D==eeER   .    .    .    ..   pushq   %rbx
[0,2]     .D===eeER .    .    .    ..   pushq   %rcx
[0,3]     .D=====eeER    .    .    ..   pushq   %rdx
[0,4]     . D======eeER  .    .    ..   pushq   %r8
[0,5]     . D========eeER.    .    ..   pushq   %r9
[0,6]     .  D=========eeER   .    ..   pushq   %r10
[0,7]     .  D===========eeER .    ..   pushq   %r11
[0,8]     .   D============eeER    ..   pushq   %r12
[0,9]     .   D==============eeER  ..   pushq   %r13
[0,10]    .    D===============eeER..   pushq   %r14
[0,11]    .    D=================eeER   pushq   %r15


Average Wait times (based on the timeline view):
[0]: Executions
[1]: Average time spent waiting in a scheduler's queue
[2]: Average time spent waiting in a scheduler's queue while ready
[3]: Average time elapsed from WB until retire stage

      [0]    [1]    [2]    [3]
0.     1     1.0    1.0    0.0       pushq  %rax
1.     1     3.0    0.0    0.0       pushq  %rbx
2.     1     4.0    0.0    0.0       pushq  %rcx
3.     1     6.0    0.0    0.0       pushq  %rdx
4.     1     7.0    0.0    0.0       pushq  %r8
5.     1     9.0    0.0    0.0       pushq  %r9
6.     1     10.0   0.0    0.0       pushq  %r10
7.     1     12.0   0.0    0.0       pushq  %r11
8.     1     13.0   0.0    0.0       pushq  %r12
9.     1     15.0   0.0    0.0       pushq  %r13
10.    1     16.0   0.0    0.0       pushq  %r14
11.    1     18.0   0.0    0.0       pushq  %r15
       1     9.5    0.1    0.0       <total>

llvm-mca -mcpu=skylake -timeline -iterations=1 test_mov.s -o test_mov.txt

test_mov.txt

Iterations:        1
Instructions:      13
Total Cycles:      15
Total uOps:        13

Dispatch Width:    6
uOps Per Cycle:    0.87
IPC:               0.87
Block RThroughput: 12.0


Instruction Info:
[1]: #uOps
[2]: Latency
[3]: RThroughput
[4]: MayLoad
[5]: MayStore
[6]: HasSideEffects (U)

[1]    [2]    [3]    [4]    [5]    [6]    Instructions:
 1      1     1.00           *            movq  %rax, -8(%rsp)
 1      1     1.00           *            movq  %rbx, -16(%rsp)
 1      1     1.00           *            movq  %rcx, -24(%rsp)
 1      1     1.00           *            movq  %rdx, -32(%rsp)
 1      1     1.00           *            movq  %r8, -40(%rsp)
 1      1     1.00           *            movq  %r9, -48(%rsp)
 1      1     1.00           *            movq  %r10, -56(%rsp)
 1      1     1.00           *            movq  %r11, -64(%rsp)
 1      1     1.00           *            movq  %r12, -72(%rsp)
 1      1     1.00           *            movq  %r13, -80(%rsp)
 1      1     1.00           *            movq  %r14, -88(%rsp)
 1      1     1.00           *            movq  %r15, -96(%rsp)
 1      1     0.25                        subq  $96, %rsp


Resources:
[0]   - SKLDivider
[1]   - SKLFPDivider
[2]   - SKLPort0
[3]   - SKLPort1
[4]   - SKLPort2
[5]   - SKLPort3
[6]   - SKLPort4
[7]   - SKLPort5
[8]   - SKLPort6
[9]   - SKLPort7


Resource pressure per iteration:
[0]    [1]    [2]    [3]    [4]    [5]    [6]    [7]    [8]    [9]    
 -      -      -      -     4.00   4.00   12.00   -     1.00   4.00   

Resource pressure by instruction:
[0]    [1]    [2]    [3]    [4]    [5]    [6]    [7]    [8]    [9]    Instructions:
 -      -      -      -      -      -     1.00    -      -     1.00   movq  %rax, -8(%rsp)
 -      -      -      -      -     1.00   1.00    -      -      -     movq  %rbx, -16(%rsp)
 -      -      -      -     1.00    -     1.00    -      -      -     movq  %rcx, -24(%rsp)
 -      -      -      -      -      -     1.00    -      -     1.00   movq  %rdx, -32(%rsp)
 -      -      -      -      -     1.00   1.00    -      -      -     movq  %r8, -40(%rsp)
 -      -      -      -     1.00    -     1.00    -      -      -     movq  %r9, -48(%rsp)
 -      -      -      -      -      -     1.00    -      -     1.00   movq  %r10, -56(%rsp)
 -      -      -      -      -     1.00   1.00    -      -      -     movq  %r11, -64(%rsp)
 -      -      -      -     1.00    -     1.00    -      -      -     movq  %r12, -72(%rsp)
 -      -      -      -      -      -     1.00    -      -     1.00   movq  %r13, -80(%rsp)
 -      -      -      -      -     1.00   1.00    -      -      -     movq  %r14, -88(%rsp)
 -      -      -      -     1.00    -     1.00    -      -      -     movq  %r15, -96(%rsp)
 -      -      -      -      -      -      -      -     1.00    -     subq  $96, %rsp


Timeline view:
                    01234
Index     0123456789     

[0,0]     DeER .    .   .   movq    %rax, -8(%rsp)
[0,1]     D=eER.    .   .   movq    %rbx, -16(%rsp)
[0,2]     D==eER    .   .   movq    %rcx, -24(%rsp)
[0,3]     D===eER   .   .   movq    %rdx, -32(%rsp)
[0,4]     D====eER  .   .   movq    %r8, -40(%rsp)
[0,5]     D=====eER .   .   movq    %r9, -48(%rsp)
[0,6]     .D=====eER.   .   movq    %r10, -56(%rsp)
[0,7]     .D======eER   .   movq    %r11, -64(%rsp)
[0,8]     .D=======eER  .   movq    %r12, -72(%rsp)
[0,9]     .D========eER .   movq    %r13, -80(%rsp)
[0,10]    .D=========eER.   movq    %r14, -88(%rsp)
[0,11]    .D==========eER   movq    %r15, -96(%rsp)
[0,12]    . DeE---------R   subq    $96, %rsp


Average Wait times (based on the timeline view):
[0]: Executions
[1]: Average time spent waiting in a scheduler's queue
[2]: Average time spent waiting in a scheduler's queue while ready
[3]: Average time elapsed from WB until retire stage

      [0]    [1]    [2]    [3]
0.     1     1.0    1.0    0.0       movq   %rax, -8(%rsp)
1.     1     2.0    1.0    0.0       movq   %rbx, -16(%rsp)
2.     1     3.0    1.0    0.0       movq   %rcx, -24(%rsp)
3.     1     4.0    1.0    0.0       movq   %rdx, -32(%rsp)
4.     1     5.0    1.0    0.0       movq   %r8, -40(%rsp)
5.     1     6.0    1.0    0.0       movq   %r9, -48(%rsp)
6.     1     6.0    1.0    0.0       movq   %r10, -56(%rsp)
7.     1     7.0    1.0    0.0       movq   %r11, -64(%rsp)
8.     1     8.0    1.0    0.0       movq   %r12, -72(%rsp)
9.     1     9.0    1.0    0.0       movq   %r13, -80(%rsp)
10.    1     10.0   1.0    0.0       movq   %r14, -88(%rsp)
11.    1     11.0   1.0    0.0       movq   %r15, -96(%rsp)
12.    1     1.0    1.0    9.0       subq   $96, %rsp
       1     5.6    1.0    0.7       <total>

Question: Possible reasons I considered:

  1. Code size: push is more compact (1 byte vs ~4 bytes for mov with offset).
  2. Hardware optimizations: Do real CPUs handle push better (Probably stack engine optimization, but I can't find information about it)?
  3. Compiler heuristics: Is push chosen for legacy compatibility or edge cases?
like image 944
Moi5t Avatar asked Sep 06 '25 03:09

Moi5t


1 Answers

llvm-mca's model is wrong. Intel since Pentium-M, and AMD since K10, have a "stack engine" which handles offsets to RSP from stack ops like push/pop and call/ret, so they can be single uop and not have a dependency chain through the stack pointer, rSP. -march=k8 and -march=nocona (Pentium 4) are the only mainstream x86-64 without a stack engine, and they're so old they're irrelevant for -mtune=generic.

See What is the stack engine in the Sandybridge microarchitecture? and Agner Fog's microarch guide, https://agner.org/optimize/ . For accurate instruction uop counts and throughput/latency, see https://uops.info/. If you click on a number in its table, you can see the loops that were profiled with nanobench to give that result, and the perf counter results. Great resource, thanks to @AndreasAbel for creating it. (although it hasn't been updated for Zen 5 or for Intel newer than Alder Lake :/)

https://uica.uops.info/ is a loop analyzer for Intel SnB-family up to Ice Lake, like LLVM-MCA except accurate. It uses uop data from uops.info, and a model of the front-end issue and uop-to-port allocation patterns that Intel CPUs use, among other things. Andreas wrote a paper about it, showing it predicts performance more accurately than IACA or LLVM-MCA.


With -m32 -mtune=pentium3 you should see GCC avoiding push/pop, including for passing args with -maccumulate-outgoing-args. Maybe also with Clang/LLVM, but it's a newer compiler that never needed to care about tuning for some really old CPUs.

I'm not sure when Silvermont-family got a stack engine, but almost certainly push is efficient on modern E-cores descended from Silvermont. First-gen Silvermont ran push as a single uop for both integer pipes, and pop as 2 uops, so maybe it didn't have a stack engine, but its overall low throughput made it not a big problem. Goldmont, second-gen Silvermont-family, has single-uop pop reg. And it does limited mov-elimination in rename, so probably they're also eliminating RSP updates, too.

Fun fact: Alder Lake P-cores (Golden Cove) eliminate add/sub reg, small-immediate in general, but separate from the stack engine even for RSP so you still need stack-sync uops. For the general mechanism that works on add/sub/inc/dec, no extra sync uop is needed when reading the value, though. (https://chipsandcheese.com/i/149874006/rename-and-allocate-feeding-the-backend talks about it, and has a table for Lion Cove and Redwood Cove vs. Zen 5.)


A stack-sync uop is needed when instructions like mov rbp, rsp follow some stack ops, to update the back-end with the actual RSP value and zero out the offset in the renamer. Any instruction which reads RSP explicitly has this effect when the stack engine offset is non-zero, like mov eax, [rsp+8]. Also if the offset gets too high; apparently it's signed 8-bit in P-M, so if you were really running push in a loop like you're asking LLVM-MCA to analyze, you'd expect 1 in 32 push instructions to effectively be multi-uop. (Assuming it's still only an 8-bit offset, but that makes sense; most use-cases will use sub rsp, imm at some point, although a long chain of pushes and calls is plausible possible.)

On function entry, unless it was a jmp tailcall, the last instruction was call so the stack-engine offset is already non-zero. So there's no downside to using push even if we are going to allocate some stack space later with sub rsp, 24 or something. (In a function that starts with push rbp / mov rbp, rsp, doing more pushes after that will normally require a stack-sync uop at some point, but that's usually fine and worth it for the code-size advantage of push. Function prologues shouldn't run super often; small hot functions should get inlined in most cases, especially with -flto for cross-file inlining.)

I even used pop instead of lodsd to optimize a code-golf size+performance BigInt Fibonacci problem. I profiled it with performance counters on my i7-6700k Skylake, so I'm 100% certain the stack engine exists and works as intended to make pop fast and single-uop.


The one advantage to sub rsp, imm + mov-stores would be smaller stack-unwind metadata since RSP only moves once, so only one entry. So smaller executables and libraries. But that's metadata, separate from .text, so it doesn't even help with load times or with I-cache or iTLB footprint. Using RBP as a frame pointer also allows smaller stack-unwind metadata, but also isn't worth it when we care about speed.

You're correct that push being a 1 or 2 byte instruction is a nice advantage.

like image 119
Peter Cordes Avatar answered Sep 07 '25 20:09

Peter Cordes