Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Cost of push vs. mov (stack vs. near memory), and the overhead of function calls

Question:

Is accessing the stack the same speed as accessing memory?

For example, I could choose to do some work within the stack, or I could do work directly with a labelled location in memory.

So, specifically: is push ax the same speed as mov [bx], ax? Likewise is pop ax the same speed as mov ax, [bx]? (assume bx holds a location in near memory.)

Motivation for Question:

It is common in C to discourage trivial functions that take parameters.

I've always thought that is because not only must the parameters get pushed onto the stack and then popped off the stack once the function returns, but also because the function call itself must preserve the CPU's context, which means more stack usage.

But assuming one knows the answer to the headlined question, it should be possible to quantify the overhead that the function uses to set itself up (push / pop / preserve context, etc.) in terms of an equivalent number of direct memory accesses. Hence the headlined question.


(Edit: Clarification: near used above is as opposed to far in the segmented memory model of 16-bit x86 architecture.)
like image 682
Assad Ebrahim Avatar asked Oct 07 '12 06:10

Assad Ebrahim


2 Answers

Nowadays your C compiler can outsmart you. It may inline simple functions and if it does that, there will be no function call or return and, perhaps, there will be no additional stack manipulations related to passing and accessing formal function parameters (or an equivalent operation when the function is inlined but the available registers are exhausted) if everything can be done in registers or, better yet, if the result is a constant value and the compiler can see that and take advantage of it.

Function calls themselves can be relatively cheap (but not necessarily zero-cost) on modern CPUs, if they're repeated and if there's a separate instruction cache and various predicting mechanisms, helping with efficient code execution.

Other than that, I'd expect the performance implications of the choice "local var vs global var" to depend on the memory usage patterns. If there's a memory cache in the CPU, the stack is likely to be in that cache, unless you allocate and deallocate large arrays or structures on it or have deep function calls or deep recursion, causing cache misses. If the global variable of interest is accessed often or if its neighbors are accessed often, I'd expect that variable to be in the cache most of the time as well. Again, if you're accessing large spans of memory that can't fit into the cache, you'll have cache misses and possibly reduced performance (possibly because there may or may not be a better, cache-friendly way of doing what you want to do).

If the hardware is pretty dumb (no or small caches, no prediction, no instruction reordering, no speculative execution, nothing), clearly you want to reduce the memory pressure and the number of function calls because each and everyone will count.

Yet another factor is instruction length and decoding. Instructions to access an on-stack location (relative to the stack pointer) can be shorter than instructions to access an arbitrary memory location at a given address. Shorter instructions may be decoded and executed faster.

I'd say there's no definitive answer for all cases because performance depends on:

  • your hardware
  • your compiler
  • your program and its memory accessing patterns
like image 124
Alexey Frunze Avatar answered Oct 05 '22 16:10

Alexey Frunze


For the clock-cycle-curious...

For those who would like to see specific clock cycles, instruction / latency tables for a variety of modern x86 and x86-64 CPUs are available here (thanks to hirschhornsalz for pointing these out).

You then get, on a Pentium 4 chip:

  • push ax and mov [bx], ax (red boxed) are virtually identical in their efficiency with identical latencies and throughputs.
  • pop ax and mov ax, [bx] (blue boxed) are similarly efficient, with identical throughputs despite mov ax, [bx] having twice the latency of pop ax

Pentium 4 Instruction Timing Table

As far as the follow-on question in the comments (3rd comment):

  • indirect addressing (i.e. mov [bx], ax) is not materially different than direct addressing (i.e. mov [loc], ax), where loc is a variable holding an immediate value, e.g. loc equ 0xfffd.

Conclusion: Combine this with Alexey's thorough answer, and there's a pretty solid case for the efficiency of using the stack and letting the compiler decide when a function should be inlined.

(Side note: In fact, even as far back as the 8086 from 1978, using the stack was still not less efficient than corresponding mov's to memory as can be seen from these old 8086 instruction timing tables.)


Understanding Latency & Throughput

A bit more may be needed to understand timing tables for modern CPUs. These should help:

  • definitions of latency and throughput
  • a useful analogy for latency and throughput, and their relation to instruction processing pipelines)
like image 32
Assad Ebrahim Avatar answered Oct 05 '22 15:10

Assad Ebrahim