Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Finding number of operands in an instruction from opcodes

I am planning on writing my own small disassembler. I want to decode the opcodes which I get upon reading the executable. I see the following opcodes:

69 62 2f 6c 64 2d 6c

which must correspond to:

imul   $0x6c2d646c,0x2f(%edx),%esp

Now, the "imul" instruction can have either two or three operands. How do I figure this out from the opcodes I have there?

It's based on Intel's i386 instruction set.

like image 565
Hrishikesh Murali Avatar asked Aug 03 '11 10:08

Hrishikesh Murali


1 Answers

Although the x86 instruction set is quite complex (it's CISC anyway) and I saw many people here are discouraging your attempts in trying to understand it, I'll say the contrary: it still can be understood, and you can learn on the way about why is it so complex and how Intel had managed to extend it several times all the way from 8086 to modern processors.

x86 instructions use variable-length encoding, so they can be made up of multiple bytes. Each byte is there to encode different things, and some of them are optional (it is encoded in the opcode whether those optional fields are used or not).

For example, each opcode can be preceded by zero to four prefix bytes, which are optional. Usually you don't need to worry about them. They are used to change the size of operands, or as escape codes to the "second floor" of the opcode table with extended instructions of modern CPUs (MMX, SSE etc.).

Then there is the actual opcode, which is usually one byte, but can be up to three bytes for extended instructions. If you use only the basic instruction set, you don't need to worry about them too.

Next, there's the so called ModR/M byte (sometimes also called mode-reg-reg/mem), which encodes the addressing mode and operand types. It's used only by opcodes which do have any such operands. It has three bit fields:

  • First two bits (from the left, most significant ones) encode the addressing mode (4 possible bit combinations).
  • Next three bits encode the first register (8 possible bit combinations).
  • The last three bits can encode another register, or extend the addressing mode, depending on what's the setup of the first two bits.

After the ModR/M byte, there could be another optional byte (depending on the addressing mode) called SIB (Scale Index Base). It is used for more exotic addressing modes to encode the scaling factor (1x,2x,4x), base address/register, and index register used. It has the similar layout as the ModR/M byte, but the first two bits from the left (most significant) are used to encode the scale, and the next three and the last three bits encode index and base registers, as the name suggests.

If there's any displacement used, it goes just after that. It may be 0, 1, 2 or 4 bytes long, depending on the addressing mode and execution mode (16-bit/32-bit/64-bit).

The last one is always the immediate data, if any. It can be also 0, 1, 2 or 4 bytes long.

So now, when you know the overall format of x86 instructions, you just need to know what are the encodings for all those bytes. And there are some patterns, contrary to common beliefs.

For example, all register encodings follow a neat pattern ACDB. That is, for 8-bit instructions, the lowest two bits of the register code encode the A, C, D and B registers, correspondingly:

00 = A register (accumulator)
01 = C register (counter)
10 = D register (data)
11 = B register (base)

I suspect that their 8-bit processors used just these four 8-bit registers encoded this way:

       second
      +---+---+
f     | 0 | 1 |          00 = A
i +---+---+---+          01 = C
r | 0 | A : C |          10 = D
s +---+ - + - +          11 = B
t | 1 | D : B |
  +---+---+---+

Then, on 16-bit processors, they doubled this bank of registers and added one more bit in the register encoding to choose the bank, this way:

       second                second         0 00  =  AL
      +----+----+           +----+----+     0 01  =  CL
f     | 0  | 1  |     f     | 0  | 1  |     0 10  =  DL
i +---+----+----+     i +---+----+----+     0 11  =  BL
r | 0 | AL : CL |     r | 0 | AH : CH |
s +---+ - -+ - -+     s +---+ - -+ - -+     1 00  =  AH
t | 1 | DL : BL |     t | 1 | DH : BH |     1 01  =  CH
  +---+---+-----+       +---+----+----+     1 10  =  DH
    0 = BANK L              1 = BANK H      1 11  =  BH

But now you can also choose to use both halves of these registers together, as full 16-bit registers. This is done by the last bit of the opcode (the least significant bit, the right-most one): if it's 0, this is an 8-bit instruction. But if this bit is set (that is, the opcode is an odd number), this is a 16-bit instruction. In this mode, the two bits encode one of the ACDB registers, as before. The patterns stays the same. But they encode full 16-bit registers now. But when the third byte (the highest one) is also set, they switch to a whole another bank of registers, called index/pointer registers, which are: SP (stack pointer), BP (base pointer), SI (source index), DI (destination/data index). So the addressing is now as follows:

       second                second         0 00  =  AX
      +----+----+           +----+----+     0 01  =  CX
f     | 0  | 1  |     f     | 0  | 1  |     0 10  =  DX
i +---+----+----+     i +---+----+----+     0 11  =  BX
r | 0 | AX : CX |     r | 0 | SP : BP |
s +---+ - -+ - -+     s +---+ - -+ - -+     1 00  =  SP
t | 1 | DX : BX |     t | 1 | SI : DI |     1 01  =  BP
  +---+----+----+       +---+----+----+     1 10  =  SI
    0 = BANK OF           1 = BANK OF       1 11  =  DI
  GENERAL-PURPOSE        POINTER/INDEX
     REGISTERS             REGISTERS

When introducing 32-bit CPUs, they doubled these banks again. But the pattern stays the same. Just now the odd opcodes mean the 32-bit registers and the even opcodes, as before, 8-bit registers. I'd call the odd opcodes the "long" versions, because the 16/32-bit version is used depending on the CPU and its current mode of operation. When it operates in 16-bit mode, the odd ("long") opcodes mean 16-bit registers, but when it operates in 32-bit mode, the odd ("long") opcodes mean 32-bit registers. It can be flipped around by prefixing the whole instruction with the 66 prefix (operand size override). The even opcodes (the "short" ones) are always 8-bit. So in 32-bit CPU, the register codes are:

0 00 = EAX      1 00 = ESP
0 01 = ECX      1 01 = EBP
0 10 = EDX      1 10 = ESI
0 11 = EBX      1 11 = EDI

As you can see, the ACDB pattern stays the same. Also the SP,BP,SI,SI pattern stays the same. It just uses the longer versions of the registers.

There are also some patterns in the opcodes. One of them I've described already (the even vs. odd = 8-bit "short" vs. 16/32-bit "long" stuff). More of them you can see in this opcode map I've made once for quick referencing and hand-assembling/disassembling stuff: enter image description here (It's not a full table yet, some of the opcodes are missing. Maybe I'll update it someday.)

As you can see, arithmetic & logic instructions are mostly located in the upper half of the table, and the left & right halves of it follow a similar layout. Data moving instructions are at the lower half. All branching instructions (conditional jumps) are in row 7*. There's also one full row B* reserved for mov instruction, which is a shorthand for loading immediate values (constants) into registers. They're all one-byte opcodes immediately followed by the immediate constant, because they encode the destination register in the opcode (they're chosen by the column number in this table), in its three least significant bytes (right-most ones). They follow the same pattern for register encoding. And the fourth bit is the "short"/"long" choosing one. You can see that your imul instruction is alreay in the table, exactly at the 69 position (huh.. ;J).

For many instructions, the bit just before the "short/long" bit, is to encode the order of operands: which one of the two registers encoded in the ModR/M byte is the source, and which one is the destination (this applies to the instructions with two register operands).

As to the ModR/M byte's addressing mode field, here's how to interpret it:

  • 11 is the simplest: it encodes register-to-register transfers. One register is encoded by the three next bits (the reg field), and the other register by the other three bits (the R/M field) of this byte.
  • 01 means that after this byte, a one-byte displacement will be present.
  • 10 means the same, but the displacement used is four-byte (on 32-bit CPUs).
  • 00 is the trickiest: it means indirect addressing or a simple displacement, depending on the contents of the R/M field.

If the SIB byte is present, it is signaled by the 100 bit pattern in the R/M bits. There's also a code 101 for 32-bit displacement-only mode, which doesn't use the SIB byte at all.

Here's a summary of all these addressing modes:

Mod R/M
 11 rrr = register-register  (one encoded in `R/M` bits, the other one in `reg` bits).
 00 rrr = [ register ]       (except SP and BP, which are encoded in `SIB` byte)
 00 100 = SIB byte present
 00 101 = 32-bit displacement only (no `SIB` byte required)
 01 rrr = [ rrr + disp8 ]    (8-bit displacement after the `ModR/M` byte)
 01 100 = SIB + disp8
 10 rrr = [ rrr + disp32 ]   (except SP, which means that the `SIB` byte is used)
 10 100 = SIB + disp32

So let's now decode your imul:

69 is its opcode. It encodes the imul's version which doesn't sign-extend the 8-bit operands. The 6B version does sign-extend them. (They differ by the bit 1 in the opcode if anyone asked.)

62 is the RegR/M byte. In binary it is 0110 0010 or 01 100 010. First two bytes (the Mod field) mean the indirect addressing mode, and that the displacement will be 8-bit. The next three bits (the reg field) are 100 and encode the SP register (in this case ESP, since we're in 32-bit mode) as the destination register. The last three bits are the R/M field and we have 010 there, which encode the D register (in this case EDX) as the other (source) register used.

Now we expect an 8-bit displacement. And there it is: 2f is the displacement, a positive one (+47 in decimal).

The last part is four bytes of the immediate constant, which is required by the imul instruction. In your case this is 6c 64 2d 6c which in little-endian is $6c2d646c.

And that's the way the cookie crumbles ;-J

like image 195
SasQ Avatar answered Sep 24 '22 03:09

SasQ