Perf startup overhead: Why does a simple static executable which performs MOV + SYS_exit have so many stalled cycles (and instructions)?

Question

I'm trying to understand how to measure performance and decided to write the very simple program:

section .text
    global _start

_start:
    mov rax, 60
    syscall

And I ran the program with perf stat ./bin The thing I was surprised by is the stalled-cycles-frontend was too high.

      0.038132      task-clock (msec)         #    0.148 CPUs utilized          
             0      context-switches          #    0.000 K/sec                  
             0      cpu-migrations            #    0.000 K/sec                  
             2      page-faults               #    0.052 M/sec                  
       107,386      cycles                    #    2.816 GHz                    
        81,229      stalled-cycles-frontend   #   75.64% frontend cycles idle   
        47,654      instructions              #    0.44  insn per cycle         
                                              #    1.70  stalled cycles per insn
         8,601      branches                  #  225.559 M/sec                  
           929      branch-misses             #   10.80% of all branches        

   0.000256994 seconds time elapsed

As I understand the stalled-cycles-frontend it means that CPU frontend has to wait for the result of some operation (e.g. bus-transaction) to complete.

So what caused CPU frontend to wait for most of the time in that simplest case?

And 2 page faults? Why? I read no memory pages.

Peter Cordes · Accepted Answer

Page faults includes code pages.

perf stat includes startup overhead.

IDK the details of how perf starts counting, but presumably it has to program the performance counters in kernel mode, so they're counting while the CPU switches back to user mode (stalling for many cycles, especially on a kernel with Meltdown defenses which invalidates the TLBs).

I guess most of the 47,654 instructions that were recorded was kernel code. Perhaps including the page-fault handler!

I guess your process never goes user->kernel->user, the whole process is kernel->user->kernel (startup, syscall to invoke sys_exit, then never returns to user-space), so there's never a case where the TLBs would have been hot anyway, except maybe when running inside the kernel after the sys_exit system call. And anyway, TLB misses aren't page faults, but this would explain lots of stalled cycles.

The user->kernel transition itself explains about 150 stalled cycles, BTW. syscall is faster than a cache miss (except it's not pipelined, and in fact flushes the whole pipeline; i.e. the privilege level is not renamed.)

Perf startup overhead: Why does a simple static executable which performs MOV + SYS_exit have so many stalled cycles (and instructions)?

Tags:

performance

linux

assembly

x86-64

perf

St.Antario

1 Answers

Peter Cordes

Recent Activity

Donate For Us

Perf startup overhead: Why does a simple static executable which performs MOV + SYS_exit have so many stalled cycles (and instructions)?

Tags:

performance

linux

assembly

x86-64

perf

St.Antario

1 Answers

Peter Cordes

Related questions

Recent Activity

Donate For Us