Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to tell length of an x86-64 instruction opcode using CPU itself?

I know that there are libraries that can "parse" binary machine code / opcode to tell the length of an x86-64 CPU instruction.

But I'm wondering, since CPU has internal circuitry to determine this, is there a way to use processor itself to tell the instruction size from a binary code? (Maybe even a hack?)

like image 630
MikeF Avatar asked Jul 26 '18 19:07

MikeF


People also ask

How long is an x86 instruction?

x86 instructions can be anywhere between 1 and 15 bytes long. The length is defined separately for each instruction, depending on the available modes of operation of the instruction, the number of required operands and more.

What size in bytes are opcodes on an x86 processor?

x86 opcodes are 1 byte for most common instructions, especially instructions which have existed since 8086. Instructions added later (e.g. like bsf and movsx in 386) often use 2-byte opcodes with a 0f escape byte.

How big is the x86 instruction set?

On x86-64, instructions can be up to 15 bytes long, making far to many to efficiently count them.

How big is an opcode?

The instruction consists of opcode and operands. Given the instruction set of size 12, 4 bits are required for opcode (2^4 = 16). As there are total 64 registers, 6 bits are required for identifying a register.


1 Answers

The Trap Flag (TF) in EFLAGS/RFLAGS makes the CPU single-step, i.e. take an exception after running one instruction.

So if you write a debugger, you can use the CPU's single-stepping capability to find instruction boundaries in a block of code. But only by running it, and if it faults (e.g. a load from an unmapped address) you'll get that exception instead of the TF single-step exception.

(Most OSes have facilities for attaching to and single-stepping another process, e.g. Linux ptrace, so you could maybe create an unprivileged sandbox process where your could step through some unknown bytes of machine code...)

Or as @Rbmn points out, you can use OS-assisted debug facilities to single-step yourself.


@Harold and @MargaretBloom also point out that you can put bytes at the end of a page (followed by an unmapped page) and run them. See if you get a #UD, a page fault, or a #GP exception.

  • #UD: the decoders saw a complete but invalid instruction.
  • page fault on the unmapped page: the decoders hit the unmapped page before deciding that it was an illegal instruction.
  • #GP: the instruction was privileged or faulted for other reasons.

To rule out decoding+running as a complete instruction and then faulting on the unmapped page, start with only 1 byte before the unmapped page, and keep adding more bytes until you stop getting page faults.

Breaking the x86 ISA by Christopher Domas goes into more detail about this technique, including using it to find undocumented illegal instructions, e.g. 9a13065b8000d7 is a 7-byte illegal instruction; that's when it stops page-faulting. (objdump -d just says 0x9a (bad) and decodes the rest of the bytes, but apparently real Intel hardware isn't satisfied that it's bad until it's fetched 6 more bytes).


HW performance counters like instructions_retired.any also expose instruction counts, but without knowing anything about the end of an instruction, you don't know where to put an rdpmc instruction. Padding with 0x90 NOPs and seeing how many instructions total were executed probably wouldn't really work because you'd have to know where to cut and start padding.


I'm wondering, why wouldn't Intel and AMD introduce an instruction for that

For debugging, normally you want to fully disassemble an instruction, not just find insn boundaries. So you need a full software library.

It wouldn't make sense to put a microcoded disassembler behind some new opcode.

Besides, the hardware decoders are only wired up to work as part of the front-end in the code-fetch path, not to feed them arbitrary data. They're already busy decoding instructions most cycles, and aren't wired up to work on data. Adding instructions that decode x86 machine-code bytes would almost certainly be done by replicating that hardware in an ALU execution unit, not by querying the decoded-uop cache or L1i (in designs where instruction boundaries are marked in L1i), or sending data through the actual front-end pre-decoders and capturing the result instead of queuing it for the rest of the front-end.

The only real high-performance use-case I can think of is emulation, or supporting new instructions like Intel's Software Development Emulator (SDE). But if you want to run new instructions on old CPUs, the whole point is that the old CPUs don't know about those new instructions.

The amount of CPU time spend disassembling machine code is pretty tiny compared to the amount of time that CPUs spend doing floating point math, or image processing. There's a reason we have stuff like SIMD FMA and AVX2 vpsadbw in the instruction set to speed up those special-purpose things that CPUs spend a lot of time doing, but not for stuff we can easily do with software.

Remember, the point of an instruction-set is to make it possible to create high-performance code, not to get all meta and specialize in decoding itself.

At the upper end of special-purpose complexity, the SSE4.2 string instructions were introduced in Nehalem. They can do some cool stuff, but are hard to use. https://www.strchr.com/strcmp_and_strlen_using_sse_4.2 (also includes strstr, which is a real use-case where pcmpistri can be faster than SSE2 or AVX2, unlike for strlen / strcmp where plain old pcmpeqb / pminub works very well if used efficiently (see glibc's hand-written asm).) Anyway, these new instructions are still multi-uop even in Skylake, and aren't widely used. I think compilers have a hard time autovectorizing with them, and most string-processing is done in languages where it's not so easy to tightly integrate a few intrinsics with low overhead.


installing a trampoline (for hotpatching a binary function.)

Even this requires decoding the instructions, not just finding their length.

If the first few instruction bytes of a function used a RIP-relative addressing mode (or a jcc rel8/rel32, or even a jmp or call), moving it elsewhere will break the code. (Thanks to @Rbmn for pointing out this corner case.)

like image 96
Peter Cordes Avatar answered Sep 24 '22 03:09

Peter Cordes