Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the actual relation between assembly, machine code, bytecode, and opcode?

What is the actual relation between assembly, machine code, bytecode, and opcode?

I have read most of the SO questions about assembly and machine code, such as this, but they are too high level and do not show examples of actual assembly code being transformed into machine code. As a result, I still don't understand how it works at a deeper level.

The ideal answer to this question would show a specific example of some assembly code, such as the snippet below, and how each assembly instruction gets mapped to machine code, bytecode, and/or opcode. An answer like this would be very helpful to future people learning assembly, because so far in the past few days of digging I haven't found any clear summary.

The main things I am looking for are:

  1. a snippet of assembly code
  2. a snippet of machine code
  3. a mapping between the snippet of assembly and machine code (how to do that mapping, or at least some general examples, and how do you know how to do this, where is all this information on the web)
  4. how to interpret the machine code (like are opcodes somehow related, and where is all the information on the web about what all those numbers mean)

Note: I don't have a computer science background, so I have just been slowly going lower level over the past several years and have now gotten to the point of wanting to understand assembly and machine code.

Relation Between Assembly and Machine Code

My current understanding is that an "assembler" (such as NASM) takes assembly code and creates machine code from it.

So when you compile some assembly such as this example.asm:

global main
section .text

main:
  call write

write:
  mov rax, 0x2000004
  mov rdi, 1
  mov rsi, message
  mov rdx, length
  syscall

section .data
message: db 'Hello, world!', 0xa
length: equ $ - message

(compile it with nasm -f macho64 -o example.o example.asm). It outputs this example.o object file:

cffa edfe 0700 0001 0300 0000 0100 0000
0200 0000 0001 0000 0000 0000 0000 0000
1900 0000 e800 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
2e00 0000 0000 0000 2001 0000 0000 0000
2e00 0000 0000 0000 0700 0000 0700 0000
0200 0000 0000 0000 5f5f 7465 7874 0000
0000 0000 0000 0000 5f5f 5445 5854 0000
0000 0000 0000 0000 0000 0000 0000 0000
2000 0000 0000 0000 2001 0000 0000 0000
5001 0000 0100 0000 0005 0080 0000 0000
0000 0000 0000 0000 5f5f 6461 7461 0000
0000 0000 0000 0000 5f5f 4441 5441 0000
0000 0000 0000 0000 2000 0000 0000 0000
0e00 0000 0000 0000 4001 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0200 0000 1800 0000
5801 0000 0400 0000 9801 0000 1c00 0000
e800 0000 00b8 0400 0002 bf01 0000 0048
be00 0000 0000 0000 00ba 0e00 0000 0f05
4865 6c6c 6f2c 2077 6f72 6c64 210a 0000
1100 0000 0100 000e 0700 0000 0e01 0000
0500 0000 0000 0000 0d00 0000 0e02 0000
2000 0000 0000 0000 1500 0000 0200 0000
0e00 0000 0000 0000 0100 0000 0f01 0000
0000 0000 0000 0000 0073 7461 7274 0077
7269 7465 006d 6573 7361 6765 006c 656e
6774 6800 

(that is the entire contents of example.o). When you then "link" that using ld -o example example.o, it gives you more machine code:

cffa edfe 0700 0001 0300 0080 0200 0000
0d00 0000 7803 0000 8500 0000 0000 0000
1900 0000 4800 0000 5f5f 5041 4745 5a45
524f 0000 0000 0000 0000 0000 0000 0000
0010 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 1900 0000 9800 0000
5f5f 5445 5854 0000 0000 0000 0000 0000
0010 0000 0000 0000 0010 0000 0000 0000
... 523 lines of this

But how did it go from assembly instructions, to those numbers? Is there some sort of standard reference that lists out all of those numbers, and what they mean, for whatever architecture you are on (I am using x86-64 through NASM on OSX), and how each set of numbers maps to each assembly instruction?

I understand that machine code is different for every machine, and there are dozens if not hundreds of different types of machines. So I am not currently looking for how assembly gets transformed to every one (that would be complicated). I just am interested in an example that illustrates how the transformation works, and any architecture can serve as the example. And from that point, I could go and research the specific architecture I am interested in and find the mapping.

Relation Between Assembly and Bytecode (or is it called "opcode"?)

So from my reading so far, assembly gets transformed into machine code as demonstrated above.

But now I get confused. I see people talk about bytecode, such as in this SO answer, showing stuff like this:

void myfunc(int a) {
  printf("%s", a);
}

The assembly for this function would look like this:

OP Params OpName     Description
13 82 6a  PushString 82 means string, 6a is the address of "%s"
                     So this function pushes a pointer to "%s" on the stack.
13 83 00  PushInt    83 means integer, 00 means the one on the top of the stack.
                     So this function gets the integer at the top of the stack,
                     And pushes it on the stack again
17 13 88 Call        1388 is printf, so this calls the printf function
03 02    Pop         This pops the two things we pushed back off the stack
02       Return      This returns to the calling code.

So then I get confused. Doing some digging, I can't tell if each of those 2-digit hex numbers like 13 82 6a are each, individually, called "opcodes", and the whole set of them is called "bytecode" as a catch-all term. In addition, I can't find a table that lists out all of these 2-digit hex numbers, and what their relation is to machine code, or assembly.

To summarize, I am very much looking forward to an example showing how assembly instructions map to machine code, and it's relation to bytecode and/or opcode. (I am not looking for how a compiler does this, just how the general mapping works). I think this would clarify it for not only myself but for many people down the road who are interested in learning more about the bare metal.

One other reason why this would be valuable to know is, so one can understand how the LLVM compiler generates machine code. Do they have some sort of "complete list" of 2-digit opcodes or machine code 4-digit sequences, and know exactly how that maps to any architecture-specific assembly? Where did they get that information from? An answer to this overall question would make it much clearer how LLVM implemented its code generation.

Update

Updating from @HansPassant's comment. I actually don't care what the actual distinctions are between the words, sorry if that wasn't clear. I just want to know this: how does assembly map to machine code (and where are places to begin looking for the references that hold that information on the web), and are opcodes or bytecode used anywhere in that process? And if so how?

like image 817
Lance Avatar asked Dec 23 '14 23:12

Lance


People also ask

What is the relationship between assembly language and machine code?

Assembly language is a low-level programming language . It equates to machine code but is more readable. It can be directly translated into machine code, but it uses mnemonics to represent the instructions to make it easier to understand.

Is byte code and assembly code same?

Bytecode is mainly for platform independence and needs a virtual environment to run. Assembly code is human readable machine code (at a bit upper level) that directly run by the CPU. Bytecode is not machine/hardware specific (directly handling hardware) but assembly code is machine/hardware specific.

Is machine code the same as opcode?

In computing, an opcode (abbreviated from operation code, also known as instruction machine code, instruction code, instruction syllable, instruction parcel or opstring) is the portion of a machine language instruction that specifies the operation to be performed.


2 Answers

Yes, each architecture has an instruction set reference that gives how instructions are encoded. For x86, it's the Intel® 64 and IA-32 Architectures Software Developer's Manual Volume 2 (2A, 2B & 2C): Instruction Set Reference, A-Z

Most assemblers, including nasm, can produce a listing file for you. Feeding your sample code to nasm -l, we get:

 1                                  global main
 2                                  section .text
 3
 4                                  main:
 5 00000000 E800000000                call write
 6
 7                                  write:
 8 00000005 B804000002                mov rax, 0x2000004
 9 0000000A BF01000000                mov rdi, 1
10 0000000F 48BE-                     mov rsi, message
11 00000011 [0000000000000000]
12 00000019 BA0E000000                mov rdx, length
13 0000001E 0F05                      syscall
14
15                                  section .data
16 00000000 48656C6C6F2C20776F-     message: db 'Hello, world!', 0xa
17 00000009 726C64210A
18                                  length: equ $ - message

You can see the generated machine code in the third column (first is line number, second is address).

Note that the output of the assembler is an object file, and the output of the linker is an executable. Both of those have a complex structure and contain more than just the machine code. This is why your hexdump differs from the above listing.

Opcode is generally considered to be the part of the machine code instruction that specifies the operation to perform. For example, in the above code you have B804000002 mov rax, 0x2000004. There B8 is the opcode, 04000002 is the immediate operand.

Bytecode is not typically used in the assembly context, it could be thought of as the machine code for a virtual machine.


For a walkthrough, x86 is a very complicated architecture. But your sample code happens to have a simple instruction, the syscall. So let's see how to turn that into machine code. Open the above mentioned reference pdf, and go to the section about syscall in chapter 4. You will immediately see it listed as opcode 0F 05. Since it doesn't take any operands, we are done, those 2 bytes are the machine code. How do we turn it back? Go to Appendix A: Opcode map. Section A.1 tells us: For 2-byte opcodes beginning with 0FH (Table A-3), skip any instruction prefixes, the 0FH byte (0FH may be preceded by 66H, F2H, or F3H) and use the upper and lower 4-bit values of the next opcode byte to index table rows and columns.. Okay so we skip the 0F and split the 05 into 0 and 5 and look that up in table A-3 in row #0, column #5. We find it is a syscall instruction.

like image 75
Jester Avatar answered Oct 03 '22 14:10

Jester


Is there some sort of standard reference that lists out all of those numbers, and what they mean, for whatever architecture you are on, and how each set of numbers maps to each assembly instruction?

Yes, though they can be very complex. Also, due to the prevalence of assemblers and compilers, they're also sort of hard to find, because pretty much nobody uses them.

Relation Between Assembly and Bytecode

  • Machine code - One or a series of values read into a CPU. Each number is an "instruction" or "opcode", and may be followed by one or more parameters to act on. In the linked code, 13 tells the processor to push a string onto the stack.
  • OpCode - The value for a command: In the sample, the opcode for pushing a string is 13.
  • Assembly - human readable instructions for a CPU's internal machine code. Pretty much always one assembly instruction per machine code instruction. In my code that you linked to, the "assembly" instruction PushString maps to machine instruction 13.
  • Byte Code - Since each processor uses a different machine code, sometimes programs compile to a machine code for an imaginary "virtual machine", and then have a program that reads this fake machine code and executes it (either via emulation or JIT). Java and C# and VB all do this. This "fake" machine code is called "byte code", though the terms are often used interchangeably.

I should note that the bytecode instructions used in this post and in my other post that you linked to are simplified extracts from a proprietary byte code I work with at my company. We have a proprietary programming language that compiles to this bytecode which is interpreted by our product, and some of the values I mentioned are real bytecodes we actually use. 13 is actually pushAnything with complex parameters, but I kept things simple for the answer.

like image 27
Mooing Duck Avatar answered Oct 03 '22 14:10

Mooing Duck