Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Functions in our code make it slower?

When we compile code and execute it, in assembly, in which our code gets converted, functions are stored in a non sequential manner. So every time a function is called, the processor needs to throw away the instructions in the pipeline. Doesn't this affect the performance of the program?

PS: I'm not considering the time invested in developing such programs without functions. Purely on the performance level. Are there any ways in which compilers deal with this to reduce it?

like image 770
ayushgp Avatar asked Mar 12 '23 06:03

ayushgp


1 Answers

So every time a function is called, the processor needs to throw away the instructions in the pipeline.

No, everything after the decode stage is still good. The CPU knows not to keep decoding after an unconditional branch (like a jmp, call, or ret). Only the instructions that have been fetched but not yet decoded are ones that shouldn't run. Until the target address is decoded from the instruction, there's nothing useful for beginning of the pipeline to do, so you get bubbles in the pipeline until the target address is known. Decoding branch instructions as early as possible thus minimizes the penalty for taken branches.

In the classic RISC pipeline, the stages are IF ID EX MEM WB (fetch, decode, execute, mem, write-back (results to registers). So when ID decodes a branch instruction, the pipeline throws away the instruction currently being fetched in IF, and the instruction currently being decoded in ID (because it's the instruction after the branch).

"Hazard" is the term for things that prevent a steady stream of instructions from going through the pipeline at one per clock. Branches are a Control Hazard. (Control as in flow-control, as opposed to data.)

If the branch target isn't in L1 I-cache, the pipeline will have to wait for instructions to streaming in from memory before the IF pipeline stage can produce a fetched instruction. I-cache misses always create a pipeline bubble. Prefetching usually avoids this for non-branching code.


More complex CPUs decode far enough ahead to detect branches and re-steer fetch soon enough to hide this bubble. This may involve a queue of decoded instructions to hide the fetch bubble.

Also, instead of actually decoding to detect branch instructions, the CPU can check every instruction address against a "Branch Target Buffer" cache. If you get a hit, you know the instruction is a branch even though you haven't decoded it yet. The BTB also holds the target address, so you can start fetching from there right away (if it's an unconditional branch or your CPU supports speculative execution based on branch prediction).


ret is actually the harder case: the return address is in a register or on the stack, not encoded directly into the instruction. It's an unconditional indirect branch. Modern x86 CPUs maintain an internal return-address predictor stack, and perform very badly when you mis-match call/ret instructions. E.g. call label / label: pop ebx is terrible for position-independent 32bit code to get EIP into EBX. That will cause a mis-predict for the next 15 or so rets up the call tree.

I think I've read that a return-address predictor stack is used by some other non-x86 microarchitectures.

See Agner Fog's microarchitecture pdf to learn more about how x86 CPUs behave (also see the x86 tag wiki), or read a computer architecture textbook to learn about simple RISC pipelines.

For more about caches and memory (mostly focused on data caching / prefetching), see Ulrich Drepper's What Every Programmer Should Know About Memory.


An unconditional branch is quite cheap, like usually a couple cycles at worst (not including I-cache misses).

The big cost of a function call is when the compiler can't see the definition of the target function, and has to assume it clobbers all the call-clobbered registers in the calling convention. (In x86-64 SystemV, all the float/vector registers, and about 8 integer registers.) This requires either spilling to memory or keeping live data in call-preserved registers. But that means the function has to save/restore those register to not break the caller.

Inter-procedural optimization to let functions take advantage of knowing which registers other functions actually clobber, and which they don't, is something compilers can do within the same compilation unit. Or even across compilation units with link-time whole-program optimization. But it can't extend across dynamic-linking boundaries, because the compiler isn't allowed to make code that will break with a differently-compiled version of the same shared library.


Are there any ways in which compilers deal with this to reduce it?

They inline small functions, or even large static functions that are only called once.

e.g.

int foo(void) { return 1; }
    mov     eax, 1    #,
    ret

int bar(int x) { return foo() + x;} 
    lea     eax, [rdi+1]      # D.2839,
    ret

As @harold points out, overdoing it with inlining can cause cache misses, too, because it inflates your code size so much that not all of your hot code fits in cache.

Intel SnB-family designs have a small but very fast uop cache that caches decoded instructions. It only holds at most 1536 uops IIRC, in lines of 6 uops each. Executing from uop cache instead of from the decoders shortens the branch-mispredict penalty from 19 to 15 cycles, IIRC (something like that, but those numbers are probably not actually correct for any specific uarch). There's also a significant frontend throughput boost compared to the decoders, esp. for long instructions which are common in vector code.

like image 177
Peter Cordes Avatar answered Mar 19 '23 15:03

Peter Cordes