While debugging some software I noticed INT3 instructions are inserted in between subroutines in many cases.
I assume these are not technically inserted 'between' functions, but instead after them, in order to pause execution if a subroutine does not execute retn
at the end for whichever reason.
Are my assumptions correct? What's the purpose of these instructions, if not?
On Linux, gcc and clang pad with 0x90 (NOP) to align functions. (Even the linker does this, when linking .o
with sections of uneven size).
There's not usually any particular advantage, except maybe when the CPU has no branch-prediction for the RET instruction at the end of a function. In that case, NOP doesn't get the CPU started on anything that takes time to recover from when the correct branch target is discovered.
The last instruction of a function might not be a RET; it might be an indirect JMP (e.g. tail-call through a function pointer). In that case, branch prediction is more likely to fail. (CALL/RET pairs are specially predicted by a return stack. Note that RET is an indirect JMP in disguise; it's basically a jmp [rsp]
and an add rsp, 8
).
The default prediction for an indirect JMP or CALL (when no Branch Target Buffer prediction is available) is to jump to the next instruction. (Apparently making no prediction and stalling until the correct target is known is either not an option, or the default prediction is usable enough for jump tables.)
If the default prediction leads to speculatively executing something that the CPU can't abort easily, like an FP sqrt or maybe something microcoded, this increases the branch misprediction penalty. Even worse if the speculatively-executed instruction causes a TLB miss, triggering a hardware pagewalk, or otherwise pollutes the cache.
An instruction like INT 3 that just generates an exception can't have any of these problems. The CPU won't try to execute the INT before it should, so nothing bad will happen. IIRC, it's recommended to place something like that after an indirect JMP if the next-instruction default-prediction isn't useful.
With random garbage between functions, even pre-decoding the 16B block of machine code that includes the RET could slow down. Modern CPUs decode in parallel in groups of 4 instructions, so they can't detect a RET until after following instructions are already decoded. (This is different from speculative execution). It's useful to avoid slow-to-decode Length-Changing-Prefixes in the bytes after an unconditional branch (like RET), since that would delay decoding of the branch.
LCP stalls only affect Intel CPUs: AMD marks instruction boundaries in their L1 cache, and decodes in larger groups. (Intel uses a decoded-uop cache to get high throughput without the power cost of actually decoding every time in a loop.)
Note that in Intel CPUs, instruction-length finding happens in an earlier stage than actual decoding. For example, the Sandybridge frontend looks like this:
(Diagram copied from David Kanter's Haswell write-up. I linked to his Sandybridge writeup, though. They're both excellent.)
See also Agner Fog's microarch pdf, and more links in the x86 tag wiki, for the details on what I described in this answer (and much more).
Incorrect assumptions.
They're padding between functions, not after. And a CPU that randomly decides to skip instructions is broken and should be thrown away.
The reason for INT 3
is twofold. It's a single-byte instruction, which means you can use it even if there's just a single byte of space. The vast majority of instructions is unsuitable because they're too long. Furthermore, it's the "debug break" instruction. This means a debugger can catch the attempt to execute code between functions. That's not caused by ignoring retn
, but for more simple reasons such as using an uninitialized function pointer.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With