I know that there are libraries that can "parse" binary machine code / opcode to tell the length of an x86-64 CPU instruction. But I'm wondering, since CPU has internal circuitry to determine this, is there a way to use processor itself to tell the instruction size from a binary code? (Maybe even a hack?)

The Trap Flag (TF) in EFLAGS/RFLAGS makes the CPU single-step, i.e. take an exception after running one instruction. So if you write a debugger, you can use the CPU's single-stepping capability to find instruction boundaries in a block of code. But only by running it, and if it faults (e.g. a load from an unmapped address) you'll get that exception instead of the TF single-step exception. (Most OSes have facilities for attaching to and single-stepping another process, e.g. Linux <code>ptrace</code>, so you could maybe create an unprivileged sandbox process where your could step through some unknown bytes of machine code...) Or as @Rbmn points out, you can use OS-assisted debug facilities to single-step yourself. <hr> @Harold and @MargaretBloom also point out that you can put bytes at the end of a page (followed by an unmapped page) and run them. See if you get a #UD, a page fault, or a #GP exception. <ul> <li> <code>#UD</code>: the decoders saw a complete but invalid instruction.</li> <li>page fault on the unmapped page: the decoders hit the unmapped page before deciding that it was an illegal instruction.</li> <li> <code>#GP</code>: the instruction was privileged or faulted for other reasons.</li> </ul> To rule out decoding+running as a complete instruction and then faulting on the unmapped page, start with only 1 byte before the unmapped page, and keep adding more bytes until you stop getting page faults. Breaking the x86 ISA by Christopher Domas goes into more detail about this technique, including using it to find undocumented illegal instructions, e.g. <code>9a13065b8000d7</code> is a 7-byte illegal instruction; that's when it stops page-faulting. (<code>objdump -d</code> just says <code>0x9a (bad)</code> and decodes the rest of the bytes, but apparently real Intel hardware isn't satisfied that it's bad until it's fetched 6 more bytes). <hr> HW performance counters like <code>instructions_retired.any</code> also expose instruction counts, but without knowing anything about the end of an instruction, you don't know where to put an <code>rdpmc</code> instruction. Padding with <code>0x90</code> NOPs and seeing how many instructions total were executed probably wouldn't really work because you'd have to know where to cut and start padding. <hr> <blockquote> I'm wondering, why wouldn't Intel and AMD introduce an instruction for that </blockquote> For debugging, normally you want to fully disassemble an instruction, not just find insn boundaries. So you need a full software library. It wouldn't make sense to put a microcoded disassembler behind some new opcode. Besides, the hardware decoders are only wired up to work as part of the front-end in the code-fetch path, not to feed them arbitrary data. They're already busy decoding instructions most cycles, and aren't wired up to work on data. Adding instructions that decode x86 machine-code bytes would almost certainly be done by replicating that hardware in an ALU execution unit, not by querying the decoded-uop cache or L1i (in designs where instruction boundaries are marked in L1i), or sending data through the actual front-end pre-decoders and capturing the result instead of queuing it for the rest of the front-end. The only real high-performance use-case I can think of is emulation, or supporting new instructions like Intel's Software Development Emulator (SDE). But if you want to run new instructions on old CPUs, the whole point is that the old CPUs don't know about those new instructions. The amount of CPU time spend disassembling machine code is pretty tiny compared to the amount of time that CPUs spend doing floating point math, or image processing. There's a reason we have stuff like SIMD FMA and AVX2 <code>vpsadbw</code> in the instruction set to speed up those special-purpose things that CPUs spend a lot of time doing, but not for stuff we can easily do with software. Remember, the point of an instruction-set is to make it possible to create high-performance code, not to get all meta and specialize in decoding itself. At the upper end of special-purpose complexity, the SSE4.2 string instructions were introduced in Nehalem. They can do some cool stuff, but are hard to use. https://www.strchr.com/strcmp_and_strlen_using_sse_4.2 (also includes strstr, which is a real use-case where <code>pcmpistri</code> can be faster than SSE2 or AVX2, unlike for strlen / strcmp where plain old <code>pcmpeqb</code> / <code>pminub</code> works very well if used efficiently (see glibc's hand-written asm).) Anyway, these new instructions are still multi-uop even in Skylake, and aren't widely used. I think compilers have a hard time autovectorizing with them, and most string-processing is done in languages where it's not so easy to tightly integrate a few intrinsics with low overhead. <hr> <blockquote> installing a trampoline (for hotpatching a binary function.) </blockquote> Even this requires decoding the instructions, not just finding their length. If the first few instruction bytes of a function used a RIP-relative addressing mode (or a <code>jcc rel8/rel32</code>, or even a <code>jmp</code> or <code>call</code>), moving it elsewhere will break the code. (Thanks to @Rbmn for pointing out this corner case.)

How to tell length of an x86-64 instruction opcode using CPU itself?

1 Answers

The Trap Flag (TF) in EFLAGS/RFLAGS makes the CPU single-step, i.e. take an exception after running one instruction.

So if you write a debugger, you can use the CPU's single-stepping capability to find instruction boundaries in a block of code. But only by running it, and if it faults (e.g. a load from an unmapped address) you'll get that exception instead of the TF single-step exception.

(Most OSes have facilities for attaching to and single-stepping another process, e.g. Linux ptrace, so you could maybe create an unprivileged sandbox process where your could step through some unknown bytes of machine code...)

Or as @Rbmn points out, you can use OS-assisted debug facilities to single-step yourself.

@Harold and @MargaretBloom also point out that you can put bytes at the end of a page (followed by an unmapped page) and run them. See if you get a #UD, a page fault, or a #GP exception.

#UD: the decoders saw a complete but invalid instruction.
page fault on the unmapped page: the decoders hit the unmapped page before deciding that it was an illegal instruction.
#GP: the instruction was privileged or faulted for other reasons.

To rule out decoding+running as a complete instruction and then faulting on the unmapped page, start with only 1 byte before the unmapped page, and keep adding more bytes until you stop getting page faults.

Breaking the x86 ISA by Christopher Domas goes into more detail about this technique, including using it to find undocumented illegal instructions, e.g. 9a13065b8000d7 is a 7-byte illegal instruction; that's when it stops page-faulting. (objdump -d just says 0x9a (bad) and decodes the rest of the bytes, but apparently real Intel hardware isn't satisfied that it's bad until it's fetched 6 more bytes).

HW performance counters like instructions_retired.any also expose instruction counts, but without knowing anything about the end of an instruction, you don't know where to put an rdpmc instruction. Padding with 0x90 NOPs and seeing how many instructions total were executed probably wouldn't really work because you'd have to know where to cut and start padding.

I'm wondering, why wouldn't Intel and AMD introduce an instruction for that

For debugging, normally you want to fully disassemble an instruction, not just find insn boundaries. So you need a full software library.

It wouldn't make sense to put a microcoded disassembler behind some new opcode.

Besides, the hardware decoders are only wired up to work as part of the front-end in the code-fetch path, not to feed them arbitrary data. They're already busy decoding instructions most cycles, and aren't wired up to work on data. Adding instructions that decode x86 machine-code bytes would almost certainly be done by replicating that hardware in an ALU execution unit, not by querying the decoded-uop cache or L1i (in designs where instruction boundaries are marked in L1i), or sending data through the actual front-end pre-decoders and capturing the result instead of queuing it for the rest of the front-end.

The only real high-performance use-case I can think of is emulation, or supporting new instructions like Intel's Software Development Emulator (SDE). But if you want to run new instructions on old CPUs, the whole point is that the old CPUs don't know about those new instructions.

The amount of CPU time spend disassembling machine code is pretty tiny compared to the amount of time that CPUs spend doing floating point math, or image processing. There's a reason we have stuff like SIMD FMA and AVX2 vpsadbw in the instruction set to speed up those special-purpose things that CPUs spend a lot of time doing, but not for stuff we can easily do with software.

Remember, the point of an instruction-set is to make it possible to create high-performance code, not to get all meta and specialize in decoding itself.

At the upper end of special-purpose complexity, the SSE4.2 string instructions were introduced in Nehalem. They can do some cool stuff, but are hard to use. https://www.strchr.com/strcmp_and_strlen_using_sse_4.2 (also includes strstr, which is a real use-case where pcmpistri can be faster than SSE2 or AVX2, unlike for strlen / strcmp where plain old pcmpeqb / pminub works very well if used efficiently (see glibc's hand-written asm).) Anyway, these new instructions are still multi-uop even in Skylake, and aren't widely used. I think compilers have a hard time autovectorizing with them, and most string-processing is done in languages where it's not so easy to tightly integrate a few intrinsics with low overhead.

installing a trampoline (for hotpatching a binary function.)

Even this requires decoding the instructions, not just finding their length.

If the first few instruction bytes of a function used a RIP-relative addressing mode (or a jcc rel8/rel32, or even a jmp or call), moving it elsewhere will break the code. (Thanks to @Rbmn for pointing out this corner case.)

answered Sep 24 '22 03:09

Peter Cordes

Related questions
                            
                                How to store the contents of a __m128d simd vector as doubles without accessing it as a union?
                            
                                Intel assembly syntax OFFSET
                            
                                Understanding sign and overflow flag in assembly
                            
                                Why do interrupts need to be disabled before switching to protected mode from real mode?
                            
                                why for loop has 1 extra instruction than expected?
                            
                                Does any floating point-intensive code produce bit-exact results in any x86-based architecture?
                            
                                Need C compiler for Windows 7 64-bit, to compile to DOS target
                            
                                Why aren't the higher 16-bits in EAX accessible by name (like AX, AH and AL)? [duplicate]
                            
                                Assembly - x86 call instruction and memory address?
                            
                                Creating a simple multiboot kernel loaded with grub2
                            
                                All asm labels becoming symbols in executable file
                            
                                Find the first instance of a character using simd
                            
                                Can and does the compiler optimize out two atomic loads? [duplicate]
                            
                                conditional jumps -- comparing c code to assembly
                            
                                Can I pop from the middle of a stack?
                            
                                Does a zero change jump on x86 clear the instruction prefetch queue?
                            
                                When should I use size directives in x86?
                            
                                how to get address of variable and dereference it in nasm x86 assembly?
                            
                                How does xchg work in Intel Assembly Language
                            
                                What does __asm volatile ("pause" ::: "memory"); do?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to tell length of an x86-64 instruction opcode using CPU itself?

Tags:

cpu-architecture

x86

x86-64

opcode

micro-architecture

MikeF

People also ask

1 Answers

Peter Cordes

Recent Activity

Donate For Us