How does an instruction decoder tell the difference between a prefix and a primary opcode?

Tags:

I'm trying to wrap my head around the x86 instruction encoding format. All the sources that I read still make the subject confusing. I'm starting to understand it a little bit but one thing that I'm having trouble with understanding is how the CPU instruction decoder differentiates an opcode prefix from an opcode.

I'm aware that the whole format of the instruction basically depends on the opcode (with extra bit fields defined in the opcode of course). Sometimes the instruction doesn't have a prefix and the opcode is the first byte. How would the decoder know?

I'm assuming that the instruction decoder would be able to tell the difference because opcode bytes and prefix bytes would not share the same binary values. So the decoder can tell if the unique binary number in the byte is an instruction or a prefix. For example (In this example we will stick to single byte opcodes) a REX or LOCK prefix would not share the same byte value as any opcode in the architecture's instruction set.

342

asked Aug 23 '21 20:08

Daniel Catalano

1 Answers

Traditional (single-byte) prefixes are different from opcode bytes like you said, so a state machine can just remember which prefixes it's seen until it gets to an opcode byte.

The 0f escape byte for 2-byte opcodes is not really a prefix. It has to be contiguous with the 2nd opcode byte. Thus, following a 0f, any byte is an opcode, even if it's something like f2 that would otherwise be a prefix. (This also applies following 0f 3a or 0f 38 2-byte escapes for SSSE3 and later, or VEX/EVEX prefixes that encode one of those escape sequences).

If you look at an opcode map, there are no entries that are ambiguous between single-byte prefix and opcode. (e.g. http://ref.x86asm.net/coder64.html, and notice how the 2-byte 0F .. opcodes are listed separately).

The decoders do have to know the current mode for this (and other things); for example x86-64 removed the 1-byte inc/dec reg opcodes for use as REX prefixes. (x86 32 bit opcodes that differ in x86-x64 or entirely removed). We can even use this difference to write polyglot machine code that runs differently when decoded in 32-bit vs. 64-bit mode, or even distinguish all 3 mode sizes.

x86 machine code is a byte stream that's not self-synchronizing (e.g. a ModRM or an immediate can be any byte). The CPU always knows where to start decoding from, either a jump target or the byte after the end of a previous instruction. That's the start of the instruction (including prefixes).

Bytes in memory are just bytes, only becoming instructions when they're decoded by the CPU. (Although in normal programs, simply disassembling from the top of the .text section does give you the program's instructions. Self-modifying and obfuscated code are not normal.)

AVX / AVX-512: multi-byte prefixes that overlap with opcodes

Multi-byte VEX and EVEX prefixes aren't that simple in 32-bit mode. For example VEX prefixes overlap with invalid encodings of LES and LDS in modes other than 64-bit. (The c4 and c5 opcodes for LES and LDS are always invalid in 64-bit mode, except as VEX prefixes.) https://wiki.osdev.org/X86-64_Instruction_Encoding#VEX.2FXOP_opcodes

In legacy / compat modes, there weren't any free bytes left that weren't already opcodes or prefixes when AVX (VEX prefixes) and AVX-512 (EVEX prefix), so the only room for extensions was as encodings for opcodes that are only valid with a limited set of ModRM bytes. (e.g. LES / LDS require a memory source, not register - this is why some bits are inverted in VEX prefixes, so the top 2 bits of the byte after c4 or c5 will always be 1 in 32-bit mode instead of 0. That's the "mode" field in ModRM, and 11 means register).

(Fun fact: VEX prefixes are not recognized in 16-bit real mode, apparently because some software used the same invalid encodings of LES / LDS as intentional traps, to be sorted out in the #UD exception handler. VEX prefixes are recognized in 16-bit protected mode, though.)

AMD64 freed up several bytes by removing instructions like AAM, as well as LES/LDS (and the one-byte inc/dec reg encodings for use as REX prefixes), but CPU vendors have continued to care about 32-bit mode and not added any extensions that are only available in 64-bit mode which could simply take advantage of those free opcode bytes. This means finding ways to cram new instruction encodings into increasingly small gaps in 32-bit machine code. (Often via mandatory prefixes, e.g. rep bsr = lzcnt on CPUs with that feature, which gives different results.)

So the decoders in modern CPUs that support AVX / BMI1/2 have to look at multiple bytes to decide whether this is a prefix for a valid AVX or other VEX-encoded instruction, or in 32-bit mode if it should decode as LES or LDS. (And I guess look at the rest of the instruction to decide if it should #UD).

But modern CPUs are looking at 16 or 32 bytes at a time anyway to find instruction boundaries in parallel. (And then later feed those groups of instruction bytes to actual decoders, again in parallel.) https://www.realworldtech.com/sandy-bridge/4/

Same goes for the prefix scheme used by AMD XOP, which is a lot like VEX.

Agner Fog's blog article Stop the instruction set war from 2009 (soon after AVX was announced, before the first hardware supporting it) has a table of remaining unused coding space for future extensions, and some notes about it being "assigned" to AMD, Intel, or Via.

Related / examples

How to tell the length of an x86 instruction? (including my answer) has some more details about x86 machine code.
https://codegolf.stackexchange.com/questions/133486/find-an-illegal-string/133622#133622 (on codegolf.SE - the shortest sequence of bytes that will definitely #UD fault if it's not jumped over. It has to be long enough that it can't be consumed by the CPU as the immediate for a mov r64, imm64 for example.)
Why does x/i on gdb give different results then disassemble? - an example of starting decode in the wrong place and decoding the middle of another instruction as something else.

Machine code tricks: decoding the same byte multiple ways

(This is not really related to prefixes, but in general seeing how the rules apply to weird cases can help understand exactly things work.)

A software disassembler does need to know a start point. This can be problematic if obfuscated code mixes code and data, and actual execution jumps to places you wouldn't get if you just assume that you can decode in order without following jumps.

Fortunately compiler-generated code doesn't do that so naive static disassembly (e.g. by objdump -d or ndisasm, as opposed to IDA) finds the same instruction boundaries that actually running the program will.

This is not a problem for running obfuscated machine code; the CPU just does what it's told, and never cares about bytes before the place you tell it to jump to. Disassembling without running / single-stepping the program is the hard thing, especially with the possibility of self-modifying code and jumps to what a naive disassembler would think was the middle of an earlier instruction.

Obfuscated machine code can even have an instruction decode one way, then jump back into what was the middle of that instruction, for a later byte to be the opcode (or prefix + opcode). Modern CPUs with uop caches or that mark instruction boundaries in I-cache run slow (but correctly) if you do this, so it's more of a fun code-golf trick (extreme code-size optimization at the expense of speed) or obfuscation technique.

For an example of this, see my codegolf.SE x86 machine code answer to Golf a Custom Fibonacci Sequence. I'll excerpt the disassembly that lines up with what the CPU sees after looping back to cfib.loop, but note that the first iteration decodes differently. So I'm using just 1 byte outside the loop instead of 2 to effectively jump into the middle for the start of the first iteration. See the linked answer for a full description and the other disassembly.

0000000000401070 <cfib>:
  401070:       eb                      .byte 0xeb      # jmp rel8 consuming the 01 add opcode as a rel8
0000000000401071 <cfib.loop>:
  401071:       01 d0                   add    eax,edx
# loop entry point on first iteration, jumping over the ModRM byte (D0) of the ADD
    (entry on first iteration):
  401073:       92                      xchg   edx,eax
  401074:       e2 fb                   loop   401071 <cfib.loop>
  401076:       c3                      ret

You can do this with opcodes that consume more later bytes, like 3D <dword> cmp eax, imm32. When the CPU sees a 3D opcode byte, it will grab the next 4 bytes as the immediate. If you later jump into those 4 bytes, they'll be considered as prefix/opcodes and everything will work (except for performance problems) the same regardless of how those bytes had previously been decoded as a different part of an instruction. The CPU has to maintain the illusion of decoding and executing 1 instruction at a time, other than performance.

I learned of this trick from @Ira Baxter's answer on Can assembled ASM code result in more than a single possible way (except for offset values)?

121

answered Sep 30 '22 01:09

Peter Cordes

Related questions
                            
                                What is the compatible subset of Intel's and AMD's x86-64 implementations?
                            
                                The Definitive Assembly Book Guide and List [closed]
                            
                                Writing bootsector in C: prevent generating stack pointer initialization
                            
                                nasm/ld "relocation truncated to fit: R_386_16"
                            
                                How does one link NASM program to libc via ld?
                            
                                Must CPU have an accumulator?
                            
                                In the CDECL calling convention, can I reuse the arguments I pushed onto the stack?
                            
                                Making g++ use SHLD/SHRD instructions
                            
                                How to split an XMM 128-bit register into two 64-bit integer registers?
                            
                                What does "MOVOU" mean in golang assembly
                            
                                Isn't 07C0:0000, the same physical address on x86 machines as 0000:7C00?
                            
                                Why does TZCNT work for my Sandy Bridge processor?
                            
                                Confused about data alignment for double variables
                            
                                What are the advantages of a frame pointer?
                            
                                What does code pattern like .size X,.-X do?
                            
                                How to get the gcc compiler to not optimize a standard library function call like printf?
                            
                                Why do I get triple fault when trying to handle an exception on 286 but not on a modern CPU nor Bochs?
                            
                                Solution needed for building a static IDT and GDT at assemble/compile/link time
                            
                                How are BIOS interrupts deconflicted with reserved hardware interrupts?
                            
                                How to optimise this 8-bit positional popcount using assembly?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How does an instruction decoder tell the difference between a prefix and a primary opcode?

Tags:

cpu-architecture

x86

assembly

machine-code

instruction-encoding