What is the actual relation between assembly, machine code, bytecode, and opcode?
I have read most of the SO questions about assembly and machine code, such as this, but they are too high level and do not show examples of actual assembly code being transformed into machine code. As a result, I still don't understand how it works at a deeper level.
The ideal answer to this question would show a specific example of some assembly code, such as the snippet below, and how each assembly instruction gets mapped to machine code, bytecode, and/or opcode. An answer like this would be very helpful to future people learning assembly, because so far in the past few days of digging I haven't found any clear summary.
The main things I am looking for are:
Note: I don't have a computer science background, so I have just been slowly going lower level over the past several years and have now gotten to the point of wanting to understand assembly and machine code.
Relation Between Assembly and Machine Code
My current understanding is that an "assembler" (such as NASM) takes assembly code and creates machine code from it.
So when you compile some assembly such as this example.asm
:
global main
section .text
main:
call write
write:
mov rax, 0x2000004
mov rdi, 1
mov rsi, message
mov rdx, length
syscall
section .data
message: db 'Hello, world!', 0xa
length: equ $ - message
(compile it with nasm -f macho64 -o example.o example.asm
). It outputs this example.o
object file:
cffa edfe 0700 0001 0300 0000 0100 0000
0200 0000 0001 0000 0000 0000 0000 0000
1900 0000 e800 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
2e00 0000 0000 0000 2001 0000 0000 0000
2e00 0000 0000 0000 0700 0000 0700 0000
0200 0000 0000 0000 5f5f 7465 7874 0000
0000 0000 0000 0000 5f5f 5445 5854 0000
0000 0000 0000 0000 0000 0000 0000 0000
2000 0000 0000 0000 2001 0000 0000 0000
5001 0000 0100 0000 0005 0080 0000 0000
0000 0000 0000 0000 5f5f 6461 7461 0000
0000 0000 0000 0000 5f5f 4441 5441 0000
0000 0000 0000 0000 2000 0000 0000 0000
0e00 0000 0000 0000 4001 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0200 0000 1800 0000
5801 0000 0400 0000 9801 0000 1c00 0000
e800 0000 00b8 0400 0002 bf01 0000 0048
be00 0000 0000 0000 00ba 0e00 0000 0f05
4865 6c6c 6f2c 2077 6f72 6c64 210a 0000
1100 0000 0100 000e 0700 0000 0e01 0000
0500 0000 0000 0000 0d00 0000 0e02 0000
2000 0000 0000 0000 1500 0000 0200 0000
0e00 0000 0000 0000 0100 0000 0f01 0000
0000 0000 0000 0000 0073 7461 7274 0077
7269 7465 006d 6573 7361 6765 006c 656e
6774 6800
(that is the entire contents of example.o
). When you then "link" that using ld -o example example.o
, it gives you more machine code:
cffa edfe 0700 0001 0300 0080 0200 0000
0d00 0000 7803 0000 8500 0000 0000 0000
1900 0000 4800 0000 5f5f 5041 4745 5a45
524f 0000 0000 0000 0000 0000 0000 0000
0010 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 0000 0000 0000 0000
0000 0000 0000 0000 1900 0000 9800 0000
5f5f 5445 5854 0000 0000 0000 0000 0000
0010 0000 0000 0000 0010 0000 0000 0000
... 523 lines of this
But how did it go from assembly instructions, to those numbers? Is there some sort of standard reference that lists out all of those numbers, and what they mean, for whatever architecture you are on (I am using x86-64 through NASM on OSX), and how each set of numbers maps to each assembly instruction?
I understand that machine code is different for every machine, and there are dozens if not hundreds of different types of machines. So I am not currently looking for how assembly gets transformed to every one (that would be complicated). I just am interested in an example that illustrates how the transformation works, and any architecture can serve as the example. And from that point, I could go and research the specific architecture I am interested in and find the mapping.
Relation Between Assembly and Bytecode (or is it called "opcode"?)
So from my reading so far, assembly gets transformed into machine code as demonstrated above.
But now I get confused. I see people talk about bytecode, such as in this SO answer, showing stuff like this:
void myfunc(int a) { printf("%s", a); }
The assembly for this function would look like this:
OP Params OpName Description 13 82 6a PushString 82 means string, 6a is the address of "%s" So this function pushes a pointer to "%s" on the stack. 13 83 00 PushInt 83 means integer, 00 means the one on the top of the stack. So this function gets the integer at the top of the stack, And pushes it on the stack again 17 13 88 Call 1388 is printf, so this calls the printf function 03 02 Pop This pops the two things we pushed back off the stack 02 Return This returns to the calling code.
So then I get confused. Doing some digging, I can't tell if each of those 2-digit hex numbers like 13 82 6a
are each, individually, called "opcodes", and the whole set of them is called "bytecode" as a catch-all term. In addition, I can't find a table that lists out all of these 2-digit hex numbers, and what their relation is to machine code, or assembly.
To summarize, I am very much looking forward to an example showing how assembly instructions map to machine code, and it's relation to bytecode and/or opcode. (I am not looking for how a compiler does this, just how the general mapping works). I think this would clarify it for not only myself but for many people down the road who are interested in learning more about the bare metal.
One other reason why this would be valuable to know is, so one can understand how the LLVM compiler generates machine code. Do they have some sort of "complete list" of 2-digit opcodes or machine code 4-digit sequences, and know exactly how that maps to any architecture-specific assembly? Where did they get that information from? An answer to this overall question would make it much clearer how LLVM implemented its code generation.
Update
Updating from @HansPassant's comment. I actually don't care what the actual distinctions are between the words, sorry if that wasn't clear. I just want to know this: how does assembly map to machine code (and where are places to begin looking for the references that hold that information on the web), and are opcodes or bytecode used anywhere in that process? And if so how?
Assembly language is a low-level programming language . It equates to machine code but is more readable. It can be directly translated into machine code, but it uses mnemonics to represent the instructions to make it easier to understand.
Bytecode is mainly for platform independence and needs a virtual environment to run. Assembly code is human readable machine code (at a bit upper level) that directly run by the CPU. Bytecode is not machine/hardware specific (directly handling hardware) but assembly code is machine/hardware specific.
In computing, an opcode (abbreviated from operation code, also known as instruction machine code, instruction code, instruction syllable, instruction parcel or opstring) is the portion of a machine language instruction that specifies the operation to be performed.
Yes, each architecture has an instruction set reference that gives how instructions are encoded. For x86, it's the Intel® 64 and IA-32 Architectures Software Developer's Manual Volume 2 (2A, 2B & 2C): Instruction Set Reference, A-Z
Most assemblers, including nasm
, can produce a listing file for you. Feeding your sample code to nasm -l
, we get:
1 global main
2 section .text
3
4 main:
5 00000000 E800000000 call write
6
7 write:
8 00000005 B804000002 mov rax, 0x2000004
9 0000000A BF01000000 mov rdi, 1
10 0000000F 48BE- mov rsi, message
11 00000011 [0000000000000000]
12 00000019 BA0E000000 mov rdx, length
13 0000001E 0F05 syscall
14
15 section .data
16 00000000 48656C6C6F2C20776F- message: db 'Hello, world!', 0xa
17 00000009 726C64210A
18 length: equ $ - message
You can see the generated machine code in the third column (first is line number, second is address).
Note that the output of the assembler is an object file, and the output of the linker is an executable. Both of those have a complex structure and contain more than just the machine code. This is why your hexdump differs from the above listing.
Opcode is generally considered to be the part of the machine code instruction that specifies the operation to perform. For example, in the above code you have B804000002 mov rax, 0x2000004
. There B8
is the opcode, 04000002
is the immediate operand.
Bytecode is not typically used in the assembly context, it could be thought of as the machine code for a virtual machine.
For a walkthrough, x86 is a very complicated architecture. But your sample code happens to have a simple instruction, the syscall
. So let's see how to turn that into machine code. Open the above mentioned reference pdf, and go to the section about syscall
in chapter 4. You will immediately see it listed as opcode 0F 05
. Since it doesn't take any operands, we are done, those 2 bytes are the machine code. How do we turn it back? Go to Appendix A: Opcode map
. Section A.1
tells us: For 2-byte opcodes beginning with 0FH (Table A-3), skip any instruction prefixes, the 0FH byte (0FH may be preceded by 66H, F2H, or F3H) and use the upper and lower 4-bit values of the next opcode byte to index table rows and columns.
. Okay so we skip the 0F
and split the 05
into 0
and 5
and look that up in table A-3
in row #0, column #5. We find it is a syscall
instruction.
Is there some sort of standard reference that lists out all of those numbers, and what they mean, for whatever architecture you are on, and how each set of numbers maps to each assembly instruction?
Yes, though they can be very complex. Also, due to the prevalence of assemblers and compilers, they're also sort of hard to find, because pretty much nobody uses them.
Relation Between Assembly and Bytecode
13
tells the processor to push a string onto the stack.13
.PushString
maps to machine instruction 13
. I should note that the bytecode instructions used in this post and in my other post that you linked to are simplified extracts from a proprietary byte code I work with at my company. We have a proprietary programming language that compiles to this bytecode which is interpreted by our product, and some of the values I mentioned are real bytecodes we actually use. 13
is actually pushAnything
with complex parameters, but I kept things simple for the answer.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With