Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

With variable length instructions how does the computer know the length of the instruction being fetched? [duplicate]

In architectures where not all the instructions are the same length, how does the computer know how much to read for one instruction? For example in Intel IA-32 some instructions are 4 bytes, some are 8 bytes, so it how does it know whether to read 4 or 8 bytes? Is it that the first instruction red when the machine is powered on has a known size and each instruction contains the size of the next one?

like image 805
Celeritas Avatar asked Jun 17 '14 16:06

Celeritas


People also ask

How do variable length instructions work?

x86 instructions are variable-length, which means that common instructions typically have a shorter encoding and so take up less space in instruction cache. Therefore, x86 chips need smaller instruction caches for the same performance.

How do you find the length of an instruction?

As machine has 32-bit architecture, therefore, 1 word = 32 bits = instruction size. As the processor has 64 register, number of bits for one register = 6 (2^6 = 64) As the processor has 45 instructions, number of bits for opcode = 6 (2^6 = 64)

Why variable length instructions are better than fixed length instructions?

For a fixed length encoding, it would be difficult to choose in advance which operations should have additional opcodes reserved for possible future addition of hint information.) Variable length encoding clearly makes finding the start of the next sequential instruction more difficult.

How does a CPU fetch data from memory?

The CPU fetches the instructions one at a time from the main memory into the registers. One register is the program counter (pc). The pc holds the memory address of the next instruction to be fetched from main memory. The CPU decodes the instruction.


1 Answers

First, the processor does not need to know how many bytes to fetch, it can fetch a convenient number of bytes sufficient to provide the targeted throughput for typical or average instruction lengths. Any extra bytes can be place in a buffer to be used in the next group of bytes to be decoded. There are tradeoffs in the width and alignment of fetch relative to the supported width of instruction decode and even with respect to the width of later parts of the pipeline. Fetching more bytes than average can reduce the impact of variability in instruction length and the effective fetch bandwidth related to taken control flow instructions.

(Taken control flow instructions may introduce a fetch bubble if the [predicted] target is not available until a cycle after the next fetch and reduce effective fetch bandwidth with targets that are less aligned than the instruction fetch. E.g., if instruction fetch is 16-byte aligned—as is common for high performance x86—a taken branch that targets the 16th [last] byte in a chunk will result in effectively only one byte of code being fetched as the other 15 bytes are discarded.)

Even for fixed length instructions, fetching multiple instructions per cycle introduces similar issues. Some implementations (e.g., MIPS R10000) would fetch as many instructions as could be decoded even if they are not aligned, as long as the group of instructions does not cross a cache line boundary. (I seem to recall that one RISC implementation two banks of Icache tags to allow fetch to cross a cache block—but not page—boundary.) Other implementations (e.g., POWER4) would fetch aligned chunks of code even for a branch targeting the last instruction in such a chunk. (For POWER4, 32 byte chunks were used containing 8 instructions but at most five instructions could pass decode per cycle. This excess fetch width could be exploited to save energy via cycles where no fetch is performed and to give spare Icache cycles for cache block filling after a miss while only having one read/write port to the Icache.)

For decoding multiple instructions per cycle, there are effectively two strategies: speculatively decode in parallel or wait for the length to be determined and use that information to parse the instruction stream into separate instructions. For an ISA like IBM's zArchitecture (S/360 descendant), the length in 16-bit parcels is trivially determined by two bits in the first parcel, so waiting for the lengths to be determined makes more sense. (RISC V's slightly more complex length indication mechanism would still be friendly to non-speculative decode.) For an encoding like that of microMIPS or Thumb2, which only have two lengths determinable by the major opcode and for which the encoding of different length instructions is substantially different, using non-speculative decode may be preferred, especially given the likely narrow decode and emphasis on energy-efficiency, though with only two lengths some speculation may be reasonable at small decode width.

For x86, one strategy used by AMD to avoid excessive decode energy use is to use marker bits in the instruction cache indicating which byte ends an instruction. With such marker bits, it is simple to find the start of each instruction. This technique has the disadvantage that it adds to the latency of an instruction cache miss (the instructions must be predecoded) and it still requires the decoders to check that the lengths are correct (e.g., in case a jump is made into what was previously the middle of an instruction).

Intel seems to prefer the speculative parallel decode approach. Since the length of a previous instruction in a chunk to be decoded will be available after only modest delay, the second and later decoders may not need to fully decode the instruction for all starting points.

Since x86 instructions can be relatively complex, there are also often decode template constraints and at least one earlier design restricted the number of prefixes that could be used while maintaining full decode bandwidth. E.g., Haswell limits the second through fourth instructions decoded to producing only one µop while the first instruction can decode into up to four µops (with longer µop sequences using a microcode engine). Basically, this is an optimization for the common case (relatively simple instructions) at the expense of the less common case.

In more recent performance-oriented x86 designs, Intel has used a µop cache which stores instructions in decoded format avoiding template and fetch width constraints and reducing energy use associated with decoding.

like image 82
Paul A. Clayton Avatar answered Oct 12 '22 22:10

Paul A. Clayton