I am looking for a brief description of the use of an assembler in producing machine code.
So I know that assembly is a 1:1 translation of machine code. But I am getting confused about object code and linkers and how they place into it.
I don't need a complex answer just a simple one will do fine
An assembler is a program that takes basic computer instructions and converts them into a pattern of bits that the computer's processor can use to perform its basic operations. Some people call these instructions assembler language and others use the term assembly language.
The assembler processes conditional assembly instructions and macro processing instructions during conditional assembly. During this processing, the assembler evaluates arithmetic, logical, and character conditional assembly expressions. Conditional assembly takes place before assembly time.
The purpose of an assembler is to translate assembly language into object code. Whereas compilers and interpreters generate many machine code instructions for each high-level instruction, assemblers create one machine code instruction for each assembly instruction.
It produces binary code in form of 0s and 1s. Examples are Java, C, C++ etc. compilers. Examples of assemblers are GAS, GNU etc.
Both an assembler and a compiler translate source files into object files.
Object files are effectively an intermediate step before the final executable output (generated by the linker).
The linker takes the specified object files and libraries (which are packages of object files) and resolves relocation (or 'fixup') records.
These relocation records are made when the compiler/assembler doesn't know the address of a function or variable used in the source code, and generates a reference for it by name, which can be resolved by the linker.
For example, say you want a program to print a message to the screen, seperated into two source files, and you want to assemble them seperately and link them (example using Linux x86-64 syscalls) -
main.asm :
bits 64
section .text
extern do_message
global _start
_start:
call do_message
mov rax, 1
int 0x80
message.asm :
bits 64
section .text
global do_message
do_message:
mov rdi, message
mov rcx, dword -1
xor rax, rax
repnz scasb
sub rdi, message
mov rax, 4
mov rbx, 1
mov rcx, message
mov rdx, rdi
int 0x80
ret
section .data
message: db "hello world",10,0
If you assemble these and look at the object file output of main.asm (eg, objdump -d main.o), you will notice the 'call do_message' has an address of 00 00 00 00 - which is invalid.
0000000000000000 <_start>:
0: e8 00 00 00 00 callq 5 <_start+0x5>
5: 48 c7 c0 01 00 00 00 mov $0x1,%rax
c: cd 80 int $0x80
But, a relocation record is made for the 4 bytes of the address :
$ objdump -r main.o
main.o: file format elf64-x86-64
RELOCATION RECORDS FOR [.text]:
OFFSET TYPE VALUE
0000000000000001 R_X86_64_PC32 do_message+0xfffffffffffffffc
000000000000000d R_X86_64_32 .data
The offset is '1' and the type is 'R_X86_64_PC32' which tells the linker to resolve this reference, and put the resolved address into the specified offset.
When you link the final program with 'ld -o program main.o message.o', the relocations are all resolved, and if nothing is unresolved, you are left with an executable.
When we 'objdump -d' the executable, we can see the resolved address :
00000000004000f0 <_start>:
4000f0: e8 0b 00 00 00 callq 400100 <do_message>
4000f5: 48 c7 c0 01 00 00 00 mov $0x1,%rax
4000fc: cd 80 int $0x80
The same kind of relocations are used for variables as well as functions. The same process happens when you link your program against multiple large libraries, such as libc - you define a function called 'main' which libc has an external reference to - then libc is started before your program, and calls your 'main' function when you run the executable.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With