For the sake of example, imagine I was building a virtual machine. I have a byte array and a while loop, how do I know how many bytes to read from the byte array for the next instruction to interpret an Intel-8086-like instruction?
The CPU reads the opcode at the instruction pointer, with 8086 and CISC you have one byte and two byte instructions. How do i know if the next instruction is F or FF?
Found an answer myself in this piece of text on http://www.swansontec.com/sintel.html
The operation code, or opcode, comes after any optional prefixes. The opcode tells the processor which instruction to execute. In addition, opcodes contain bit fields describing the size and type of operands to expect. The NOT instruction, for example, has the opcode 1111011w. In this opcode, the w bit determines whether the operand is a byte or a word. The OR instruction has the opcode 000010dw. In this opcode, the d bit determines which operands are the source and destination, and the w bit determines the size again. Some instructions have several different opcodes. For example, when OR is used with the accumulator register (AX or EAX) and a constant, it has the special space-saving opcode 0000110w, which eliminates the need for a separate ModR/M byte. From a size-coding perspective, memorizing exact opcode bits is not necessary. Having a general idea of what type of opcodes are available for a particular instruction is more important.
Find the Intel® Processor number. Visit the product specification page and enter the number of the Intel processors on the search box. Look for Instruction Set Extensions under the Advanced Technologies tab.
The assembly instructions are assembled (turned into their binary equivalent 0s and 1s, or from now on, logic signals). These logic signals are in-turn interpreted by the CPU, and turned into more low-level logic signals which direct the flow of the CPU to execute the particular instruction.
Instruction set size – It tells the total number of instructions defined in the processor. Opcode size – It is the number of bits occupied by the opcode which is calculated by taking log of instruction set size. Operand size – It is the number of bits occupied by the operand.
Yes, an instruction can be composed from many bytes. The average length is often less than 4 in integer code, but can be longer in SSE or especially AVX512 code (new instructions have longer encodings).
the cpu simply decodes the instruction. IN the case of 8086 the first byte tells the processor how much more to get. It doesnt have to be the first byte the first byte does have to indicate in some way that you need to get more, that more can indicate you need even more. With 8 bit instruction sets like the x86 family where you start with one byte and then see how much more you need, and also being unaligned, you have to treat the instruction stream as a bytestream in order to decode it.
You should write yourself a very simple instruction set simulator, only a handful of instruction, maybe enough to load a register, add something to it and then loop. extremely educational for what you are trying to understand, and takes maybe a half an hour if that to write.
The solution is more complex than a fixed size array.
It's all about context, this is why disassembler like IDA have complex algorithms to do this.
Instructions are variable length for x86. But if you know the start of an instruction, you know where THAT INSTRUCTION ends. Because of that, you MAY know where the next one begins. I will explain the exceptions soon. But first, here's an example:
ASM:
mov eax, 0
xor eax, eax
Machine:
b8 00 00 00 00
31 c0
Moving to eax is B8
, followed by a 32-bit (4-byte) value to move into eax (as eax is 32 bit). In other words, mov eax, immediate
will always be 5 bytes. So if you know you are starting on an instruction (not always a safe assumption), and the byte is B8
, you know it is a 5 byte instruction, and that the next instruction SHOULD start 5 bytes later.
Note that both instructions (mov eax, 0
and xor eax, eax
) effectively do the same thing, clear eax to 0.
Things can get tricky with jumps/calls. It is possible to jump into an address space that is in the "middle of an instruction"... but still execute.
Lets look at:
mov eax, 0x90909090
machine code:
b8 90 90 90 90
If we later had a jmp instruction that jumped into the address of the 3rd byte of the above instruction (in the middle of it somewhere), it would just do 3 NOPs (no operation) and fall to the next instruction after it (not setting eax to 0x90909090). This is because a NOP
is a 1-byte instruction made up of 0x90.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With