I'm trying to wrap my head around the x86 instruction encoding format. All the sources that I read still make the subject confusing. I'm starting to understand it a little bit but one thing that I'm having trouble with understanding is how the CPU instruction decoder differentiates an opcode prefix from an opcode.
I'm aware that the whole format of the instruction basically depends on the opcode (with extra bit fields defined in the opcode of course). Sometimes the instruction doesn't have a prefix and the opcode is the first byte. How would the decoder know?
I'm assuming that the instruction decoder would be able to tell the difference because opcode bytes and prefix bytes would not share the same binary values. So the decoder can tell if the unique binary number in the byte is an instruction or a prefix. For example (In this example we will stick to single byte opcodes) a REX or LOCK prefix would not share the same byte value as any opcode in the architecture's instruction set.
The instruction decoder decodes the IR value to the set of control signals. The signal decides which operation should be operated, the location of the source operands in the RF, and where the destination operand is located to store its operation result.
Instruction prefixes are used to modify the following instruction. They are used to repeat string instructions, to provide section overrides, to perform bus lock operations, and to change operand and address sizes. (
2 Operand-Size and Address-Size Instruction Prefixes. The internal encoding of an instruction can include two byte-long prefixes: the address-size prefix, 67H, and the operand-size prefix, 66H. (A later section, "Instruction Format," shows the position of the prefixes in an instruction's encoding.)
In computing, an opcode (abbreviated from operation code, also known as instruction machine code, instruction code, instruction syllable, instruction parcel or opstring) is the portion of a machine language instruction that specifies the operation to be performed.
Traditional (single-byte) prefixes are different from opcode bytes like you said, so a state machine can just remember which prefixes it's seen until it gets to an opcode byte.
The 0f
escape byte for 2-byte opcodes is not really a prefix. It has to be contiguous with the 2nd opcode byte. Thus, following a 0f
, any byte is an opcode, even if it's something like f2
that would otherwise be a prefix. (This also applies following 0f 3a
or 0f 38
2-byte escapes for SSSE3 and later, or VEX/EVEX prefixes that encode one of those escape sequences).
If you look at an opcode map, there are no entries that are ambiguous between single-byte prefix and opcode. (e.g. http://ref.x86asm.net/coder64.html, and notice how the 2-byte 0F .. opcodes are listed separately).
The decoders do have to know the current mode for this (and other things); for example x86-64 removed the 1-byte inc/dec reg
opcodes for use as REX prefixes. (x86 32 bit opcodes that differ in x86-x64 or entirely removed). We can even use this difference to write polyglot machine code that runs differently when decoded in 32-bit vs. 64-bit mode, or even distinguish all 3 mode sizes.
x86 machine code is a byte stream that's not self-synchronizing (e.g. a ModRM or an immediate can be any byte). The CPU always knows where to start decoding from, either a jump target or the byte after the end of a previous instruction. That's the start of the instruction (including prefixes).
Bytes in memory are just bytes, only becoming instructions when they're decoded by the CPU. (Although in normal programs, simply disassembling from the top of the .text
section does give you the program's instructions. Self-modifying and obfuscated code are not normal.)
Multi-byte VEX and EVEX prefixes aren't that simple in 32-bit mode. For example VEX prefixes overlap with invalid encodings of LES and LDS in modes other than 64-bit. (The c4
and c5
opcodes for LES and LDS are always invalid in 64-bit mode, except as VEX prefixes.) https://wiki.osdev.org/X86-64_Instruction_Encoding#VEX.2FXOP_opcodes
In legacy / compat modes, there weren't any free bytes left that weren't already opcodes or prefixes when AVX (VEX prefixes) and AVX-512 (EVEX prefix), so the only room for extensions was as encodings for opcodes that are only valid with a limited set of ModRM bytes. (e.g. LES / LDS require a memory source, not register - this is why some bits are inverted in VEX prefixes, so the top 2 bits of the byte after c4
or c5
will always be 1
in 32-bit mode instead of 0
.
That's the "mode" field in ModRM, and 11
means register).
(Fun fact: VEX prefixes are not recognized in 16-bit real mode, apparently because some software used the same invalid encodings of LES / LDS as intentional traps, to be sorted out in the #UD exception handler. VEX prefixes are recognized in 16-bit protected mode, though.)
AMD64 freed up several bytes by removing instructions like AAM, as well as LES/LDS (and the one-byte inc
/dec reg
encodings for use as REX prefixes), but CPU vendors have continued to care about 32-bit mode and not added any extensions that are only available in 64-bit mode which could simply take advantage of those free opcode bytes. This means finding ways to cram new instruction encodings into increasingly small gaps in 32-bit machine code. (Often via mandatory prefixes, e.g. rep bsr
= lzcnt
on CPUs with that feature, which gives different results.)
So the decoders in modern CPUs that support AVX / BMI1/2 have to look at multiple bytes to decide whether this is a prefix for a valid AVX or other VEX-encoded instruction, or in 32-bit mode if it should decode as LES or LDS. (And I guess look at the rest of the instruction to decide if it should #UD).
But modern CPUs are looking at 16 or 32 bytes at a time anyway to find instruction boundaries in parallel. (And then later feed those groups of instruction bytes to actual decoders, again in parallel.) https://www.realworldtech.com/sandy-bridge/4/
Same goes for the prefix scheme used by AMD XOP, which is a lot like VEX.
Agner Fog's blog article Stop the instruction set war from 2009 (soon after AVX was announced, before the first hardware supporting it) has a table of remaining unused coding space for future extensions, and some notes about it being "assigned" to AMD, Intel, or Via.
mov r64, imm64
for example.)(This is not really related to prefixes, but in general seeing how the rules apply to weird cases can help understand exactly things work.)
A software disassembler does need to know a start point. This can be problematic if obfuscated code mixes code and data, and actual execution jumps to places you wouldn't get if you just assume that you can decode in order without following jumps.
Fortunately compiler-generated code doesn't do that so naive static disassembly (e.g. by objdump -d
or ndisasm
, as opposed to IDA) finds the same instruction boundaries that actually running the program will.
This is not a problem for running obfuscated machine code; the CPU just does what it's told, and never cares about bytes before the place you tell it to jump to. Disassembling without running / single-stepping the program is the hard thing, especially with the possibility of self-modifying code and jumps to what a naive disassembler would think was the middle of an earlier instruction.
Obfuscated machine code can even have an instruction decode one way, then jump back into what was the middle of that instruction, for a later byte to be the opcode (or prefix + opcode). Modern CPUs with uop caches or that mark instruction boundaries in I-cache run slow (but correctly) if you do this, so it's more of a fun code-golf trick (extreme code-size optimization at the expense of speed) or obfuscation technique.
For an example of this, see my codegolf.SE x86 machine code answer to Golf a Custom Fibonacci Sequence. I'll excerpt the disassembly that lines up with what the CPU sees after looping back to cfib.loop
, but note that the first iteration decodes differently. So I'm using just 1 byte outside the loop instead of 2 to effectively jump into the middle for the start of the first iteration. See the linked answer for a full description and the other disassembly.
0000000000401070 <cfib>:
401070: eb .byte 0xeb # jmp rel8 consuming the 01 add opcode as a rel8
0000000000401071 <cfib.loop>:
401071: 01 d0 add eax,edx
# loop entry point on first iteration, jumping over the ModRM byte (D0) of the ADD
(entry on first iteration):
401073: 92 xchg edx,eax
401074: e2 fb loop 401071 <cfib.loop>
401076: c3 ret
You can do this with opcodes that consume more later bytes, like 3D <dword> cmp eax, imm32
. When the CPU sees a 3D
opcode byte, it will grab the next 4 bytes as the immediate. If you later jump into those 4 bytes, they'll be considered as prefix/opcodes and everything will work (except for performance problems) the same regardless of how those bytes had previously been decoded as a different part of an instruction. The CPU has to maintain the illusion of decoding and executing 1 instruction at a time, other than performance.
I learned of this trick from @Ira Baxter's answer on Can assembled ASM code result in more than a single possible way (except for offset values)?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With