As was advised long time ago, I always build my release executables without frame pointers (which is the default if you compile with /Ox).
However, now I read in the paper http://research.microsoft.com/apps/pubs/default.aspx?id=81176, that frame pointers don't have much of an effect on performance. So optimizing it fully (using /Ox) or optimizing it fully with frame pointers (using /Ox /Oy-) doesn't really make a difference on peformance.
Microsoft seems to indicate that adding frame pointers (/Oy-) makes debugging easier, but is this really the case?
I did some experiments and noticed that:
What is the general advice regarding frame pointers?
Using Visual Studio 2010.
Short answer: By omitting the frame pointer,
You need to use the stack pointer to access local variables and arguments. The compiler doesn't mind, but if you are coding in assember, this makes your life slightly harder. Much harder if you don't use macros.
You save four bytes (32-bit architecture) of stack space per function call. Unless you are using deep recursion, this isn't a win.
You save a memory write to a cached memory (the stack) and you (theoretically) save a few clock ticks on function entry/exit, but you can increase the code size. Unless your function is doing very little very often (in which case it should be inlined), this shouldn't be noticable.
You free up a general purpose register. If the compiler can utilize the register, it will produce code that is both substantially smaller and potentially faster. But, if most of the CPU time is spent talking to the main memory (or even the hard drive), omitting the frame pointer is not going save you from that.
The debugger will lose an easy way to generate the stack trace. The debugger might still be able to able to generate the stack trace from a different source (such as a PDB file).
Long answer:
The typical function entry and exit is:
PUSH SP ;push the frame pointer MOV FP,SP ;store the stack pointer in the frame pointer SUB SP,xx ;allocate space for local variables et al. ... LEAVE ;restore the stack pointer and pop the old frame pointer RET ;return from the function
An entry and exit without a stack pointer could look like:
SUB SP,xx ;allocate space for local variables et al. ... ADD SP,xx ;de-allocate space for local variables et al. RET ;return from the function.
You will save two instructions but you also duplicate a literal value so the code doesn't get shorter (quite the opposite), but you might have saved a few clock cycles (or not, if it causes a cache miss in the instruction cache). You did save some space on the stack, though.
You do free up a general purpose register. This has only benefits.
In regcall/fastcall, this is one extra register where you can store arguments to your function. Thus, if your function takes seven (on x86; more on most other architectures) or more arguments (including this
), the seventh argument still fits into a register. The same, more importantly, applies to local variables as well. Arrays and large objects don't fit into registers (but pointers to them do), but if your function is using seven different local variables (including temporary variables needed to calculate complex expressions), chances are the compiler will be able to produce smaller code. Smaller code means lower instruction cache footprint, which means reduced miss rate and thus even less memory access (but Intel Atom has a 32K instruction cache, meaning that your code will probably fit anyways).
The x86 architecture features the [BX/BP/SI/DI]
and [BX/BP + SI/DI]
addressing modes. This makes the BP register an extremely useful place for a scaled array index, especially if the array pointer resides in the SI or DI registers. Two offset registers are better than one.
Utilising a register avoids memory access, but if a variable is worth storing in a register, chances are it will survive just as fine in an L1 cache (especially since it's going to be on the stack). There is still the cost of moving to/from the cache, but since modern CPUs do a lot move optimisation and parallelisation, it is possible that an L1 access would be just as fast as a register access. Thus, the speed benefit from not moving data around is still present, but not as enormous. I can easily imagine the CPU avoiding the data cache completely, at least as far as reading is concerned (and writing to cache can be done in parallel).
A register that is utilised is a register that needs preserving. It is not worth storing much in the registers if you are going to push it to the stack anyways before you use it again. In preserve-by-caller calling conventions (such as the one above), this means that registers as persistent storage are not as useful in a function that calls other functions a lot.
Also note that x86 has a separate register space for floating point registers, meaning that floats cannot utilise the BP register without extra data movement instructions anyways. Only integers and memory pointers do.
What you do lose by omitting frame pointers is debugability. This answer show why:
If the code crashes, all the debugger needs to do to generate the stack trace is:
PUSH FP ; log the current frame pointer as well $1: CALL log_FP ; log the frame pointer currently on stack LEAVE ; pop the frame pointer to get the next one CMP [FP+4],0 JNZ $1 ; until the stack cannot be popped (the return address is some specific value)
If the code crashes without a frame pointer, the debugger might have no way to generate the stack trace because it might not know (namely, it needs to locate the function entry/exit point) how much needs to be subtracted from the stack pointer. If the debugger doesn't know the frame pointer is not being used, it might even crash itself.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With