With variable length instructions how does the computer know the length of the instruction being fetched? [duplicate]

Tags:

In architectures where not all the instructions are the same length, how does the computer know how much to read for one instruction? For example in Intel IA-32 some instructions are 4 bytes, some are 8 bytes, so it how does it know whether to read 4 or 8 bytes? Is it that the first instruction red when the machine is powered on has a known size and each instruction contains the size of the next one?

805

asked Jun 17 '14 16:06

Celeritas

1 Answers

First, the processor does not need to know how many bytes to fetch, it can fetch a convenient number of bytes sufficient to provide the targeted throughput for typical or average instruction lengths. Any extra bytes can be place in a buffer to be used in the next group of bytes to be decoded. There are tradeoffs in the width and alignment of fetch relative to the supported width of instruction decode and even with respect to the width of later parts of the pipeline. Fetching more bytes than average can reduce the impact of variability in instruction length and the effective fetch bandwidth related to taken control flow instructions.

(Taken control flow instructions may introduce a fetch bubble if the [predicted] target is not available until a cycle after the next fetch and reduce effective fetch bandwidth with targets that are less aligned than the instruction fetch. E.g., if instruction fetch is 16-byte aligned—as is common for high performance x86—a taken branch that targets the 16th [last] byte in a chunk will result in effectively only one byte of code being fetched as the other 15 bytes are discarded.)

Even for fixed length instructions, fetching multiple instructions per cycle introduces similar issues. Some implementations (e.g., MIPS R10000) would fetch as many instructions as could be decoded even if they are not aligned, as long as the group of instructions does not cross a cache line boundary. (I seem to recall that one RISC implementation two banks of Icache tags to allow fetch to cross a cache block—but not page—boundary.) Other implementations (e.g., POWER4) would fetch aligned chunks of code even for a branch targeting the last instruction in such a chunk. (For POWER4, 32 byte chunks were used containing 8 instructions but at most five instructions could pass decode per cycle. This excess fetch width could be exploited to save energy via cycles where no fetch is performed and to give spare Icache cycles for cache block filling after a miss while only having one read/write port to the Icache.)

For decoding multiple instructions per cycle, there are effectively two strategies: speculatively decode in parallel or wait for the length to be determined and use that information to parse the instruction stream into separate instructions. For an ISA like IBM's zArchitecture (S/360 descendant), the length in 16-bit parcels is trivially determined by two bits in the first parcel, so waiting for the lengths to be determined makes more sense. (RISC V's slightly more complex length indication mechanism would still be friendly to non-speculative decode.) For an encoding like that of microMIPS or Thumb2, which only have two lengths determinable by the major opcode and for which the encoding of different length instructions is substantially different, using non-speculative decode may be preferred, especially given the likely narrow decode and emphasis on energy-efficiency, though with only two lengths some speculation may be reasonable at small decode width.

For x86, one strategy used by AMD to avoid excessive decode energy use is to use marker bits in the instruction cache indicating which byte ends an instruction. With such marker bits, it is simple to find the start of each instruction. This technique has the disadvantage that it adds to the latency of an instruction cache miss (the instructions must be predecoded) and it still requires the decoders to check that the lengths are correct (e.g., in case a jump is made into what was previously the middle of an instruction).

Intel seems to prefer the speculative parallel decode approach. Since the length of a previous instruction in a chunk to be decoded will be available after only modest delay, the second and later decoders may not need to fully decode the instruction for all starting points.

Since x86 instructions can be relatively complex, there are also often decode template constraints and at least one earlier design restricted the number of prefixes that could be used while maintaining full decode bandwidth. E.g., Haswell limits the second through fourth instructions decoded to producing only one µop while the first instruction can decode into up to four µops (with longer µop sequences using a microcode engine). Basically, this is an optimization for the common case (relatively simple instructions) at the expense of the less common case.

In more recent performance-oriented x86 designs, Intel has used a µop cache which stores instructions in decoded format avoiding template and fetch width constraints and reducing energy use associated with decoding.

answered Oct 12 '22 22:10

Paul A. Clayton

Related questions
                            
                                x86 max/min asm instructions?
                            
                                How does the Objective-C runtime retrieve the list of classes and methods?
                            
                                How is it possible to read the CPU registers using a debugger running on the same CPU?
                            
                                How to convert a binary integer number to a hex string?
                            
                                Why are global variables in x86-64 accessed relative to the instruction pointer?
                            
                                problem in understanding mul & imul instructions of Assembly language
                            
                                What does 0x4 do in "movl $0x2d, 0x4(%esp)"?
                            
                                Understanding more about i++ and i=i+1
                            
                                Can compilers generate self modifying code?
                            
                                Z80 ASM BNF structure... am I on the right track?
                            
                                g++ 4.6.1 compiler error: Error: unknown pseudo-op: `.cfi_personality'
                            
                                Is it possible to make a custom Interrupt in Assembly?
                            
                                How does GMP stores its integers, on an arbitrary number of bytes?
                            
                                mirror bits of a 32 bit word
                            
                                Measure CPU speed by counting assembly instructions
                            
                                what does "outb" in AT&T asm mean?
                            
                                Using a virtual machine to learn assembly
                            
                                How do I decompile a .hex file into C++ for Arduino?
                            
                                Difference between .equ and .word in ARM Assembly?
                            
                                Linux kernel assembly and logic

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

With variable length instructions how does the computer know the length of the instruction being fetched? [duplicate]

Tags:

cpu-architecture

assembly

Celeritas

People also ask

1 Answers

Paul A. Clayton

Recent Activity

Donate For Us