Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does the CPU/assembler know the size of the next instruction?

For the sake of example, imagine I was building a virtual machine. I have a byte array and a while loop, how do I know how many bytes to read from the byte array for the next instruction to interpret an Intel-8086-like instruction?

EDIT: (commented)

The CPU reads the opcode at the instruction pointer, with 8086 and CISC you have one byte and two byte instructions. How do i know if the next instruction is F or FF?

EDIT:

Found an answer myself in this piece of text on http://www.swansontec.com/sintel.html

The operation code, or opcode, comes after any optional prefixes. The opcode tells the processor which instruction to execute. In addition, opcodes contain bit fields describing the size and type of operands to expect. The NOT instruction, for example, has the opcode 1111011w. In this opcode, the w bit determines whether the operand is a byte or a word. The OR instruction has the opcode 000010dw. In this opcode, the d bit determines which operands are the source and destination, and the w bit determines the size again. Some instructions have several different opcodes. For example, when OR is used with the accumulator register (AX or EAX) and a constant, it has the special space-saving opcode 0000110w, which eliminates the need for a separate ModR/M byte. From a size-coding perspective, memorizing exact opcode bits is not necessary. Having a general idea of what type of opcodes are available for a particular instruction is more important.

like image 911
Ashley Meah Avatar asked Aug 03 '14 05:08

Ashley Meah


People also ask

How do you determine the instructions of a CPU?

Find the Intel® Processor number. Visit the product specification page and enter the number of the Intel processors on the search box. Look for Instruction Set Extensions under the Advanced Technologies tab.

How does a CPU read assembly?

The assembly instructions are assembled (turned into their binary equivalent 0s and 1s, or from now on, logic signals). These logic signals are in-turn interpreted by the CPU, and turned into more low-level logic signals which direct the flow of the CPU to execute the particular instruction.

What does size of an instruction mean?

Instruction set size – It tells the total number of instructions defined in the processor. Opcode size – It is the number of bits occupied by the opcode which is calculated by taking log of instruction set size. Operand size – It is the number of bits occupied by the operand.

What is the usual size of an assembly instruction?

Yes, an instruction can be composed from many bytes. The average length is often less than 4 in integer code, but can be longer in SSE or especially AVX512 code (new instructions have longer encodings).


2 Answers

the cpu simply decodes the instruction. IN the case of 8086 the first byte tells the processor how much more to get. It doesnt have to be the first byte the first byte does have to indicate in some way that you need to get more, that more can indicate you need even more. With 8 bit instruction sets like the x86 family where you start with one byte and then see how much more you need, and also being unaligned, you have to treat the instruction stream as a bytestream in order to decode it.

You should write yourself a very simple instruction set simulator, only a handful of instruction, maybe enough to load a register, add something to it and then loop. extremely educational for what you are trying to understand, and takes maybe a half an hour if that to write.

like image 150
old_timer Avatar answered Nov 15 '22 12:11

old_timer


TLDR:

The solution is more complex than a fixed size array.


It's all about context, this is why disassembler like IDA have complex algorithms to do this.

Instructions are variable length for x86. But if you know the start of an instruction, you know where THAT INSTRUCTION ends. Because of that, you MAY know where the next one begins. I will explain the exceptions soon. But first, here's an example:

ASM:
mov eax, 0
xor eax, eax

Machine:
b8 00 00 00 00
31 c0

Explanation:

Moving to eax is B8, followed by a 32-bit (4-byte) value to move into eax (as eax is 32 bit). In other words, mov eax, immediate will always be 5 bytes. So if you know you are starting on an instruction (not always a safe assumption), and the byte is B8, you know it is a 5 byte instruction, and that the next instruction SHOULD start 5 bytes later.

Note that both instructions (mov eax, 0 and xor eax, eax) effectively do the same thing, clear eax to 0.

Exception:

Things can get tricky with jumps/calls. It is possible to jump into an address space that is in the "middle of an instruction"... but still execute.

Lets look at:

mov eax, 0x90909090

machine code:

b8 90 90 90 90

If we later had a jmp instruction that jumped into the address of the 3rd byte of the above instruction (in the middle of it somewhere), it would just do 3 NOPs (no operation) and fall to the next instruction after it (not setting eax to 0x90909090). This is because a NOP is a 1-byte instruction made up of 0x90.

like image 43
XlogicX Avatar answered Nov 15 '22 12:11

XlogicX