How does assembler compute segment and offset for symbol addresses?

Tags:

I have learned about compilers and assembly language, so I'd like to write my own assembler as an exercise. But there I have some questions;

How can I compute the address for segments such as @DATA or like OFFSET/ADDR VarA?

Take an easy assembly program as an example:

    .model small
    .stack 1024
    .data
          msg db 128 dup('A')
    .code
    start:
        mov ax,@data
        mov ax,ds
        mov dx, offset msg
                           ; DS:DX points at msg
        mov ah,4ch
        int 21h            ; exit program without using msg
    end

So how does the assembler calculate the segment address for the @data segment?

And how does it know what to put into the immediate for mov dx, offset msg?

677

asked Apr 20 '15 13:04

user152531

2 Answers

The assembler doesn't know where @data and msg will end up in memory so generates metadata called relocations (or "fixups") in the object (.OBJ) file that allow the linker and operating system to fill in the correct values.

Lets take a look at what happens with a slightly different example program:

.model small
.stack 1024
.data
    msg db 'Hello, World!,'$'
.code
start:
    mov ax,SEG msg
    mov ds,ax
    mov dx,OFFSET msg
    mov ah,09h
    int 21h              ; write string in DS:DX to stdout
    mov ah,4ch
    int 21h              ; exit(AL)
end start

When assembling this file the assembler has no way knowing where the linker will put anything defined by this example program. It may appear obvious to you, but the assembler can't assume it seeing a complete program. The assembler doesn't know if you'll link it with other object files or libraries which could cause the linker to put msg somewhere other than the start of the data segment.

So when this example program gets assembled into an object file, the assembler generates two relocation records. If you use MASM to assemble the file you can see this in listing file generated with the /Fl switch:

 ; listing of the .obj assembler output, before linking
 0000               start:
 0000  B8 ---- R            mov ax,SEG msg
 0003  8E D8                mov ds,ax
 0005  BA 0000 R            mov dx,OFFSET msg
 0008  B4 09                mov ah,09h

The R next to the operand in the machine code column of the listing indicates they have relocations the refer to them. When the linker creates the MS-DOS format executable from the object file it will able to supply correct offset from the start of the data segment for msg. That value is a link-time constant so only the .obj, not the .exe, needs a relocation for it.

However the linker won't be able to supply the location of the segment of msg (the data segment) because the linker doesn't know where MS-DOS will load the executable into memory. (Unlike under a modern mainstream OS where every process has its own virtual address space, real mode has only one address space that programs have to share with device drivers and TSRs, and the OS itself.)

So the linker will put a relocation in the generated executable that tells MS-DOS to adjust the immediate operand based on where it gets loaded.

Note that you might want to simply your assembler writing exercise by writing one that only works with complete programs and generates only .COM executables. That way you don't have to worry about relocations. Your assembler will decide where everything gets placed within the single segment allowed by the .COM format. Note that because .COM files don't support segment relocations, instructions like mov ax,@data or mov ax,SEG msg can't be used. Instead, CS=DS=ES=SS on program startup, with a value chosen by the OS's program loader. (And that value isn't known at assemble time.)

answered Oct 04 '22 10:10

Ross Ridge

How can I compute the address for segments such as @DATA or like OFFSET/ADDR VarA?

There are 2 cases:

a) the assembler is generating a flat binary or executable file itself, and no linker is involved

b) the assembler is generating an object file to be sent to a linker later

Note that you can have a mixture. For example, in some assemblers (e.g, NASM) there's keywords to create a temporary section (e.g. absolute) and structures are supported by internally using a temporary section (a field in a structure is an offset into a temporary section that begins at address zero).

For both cases; the assembler converts the source code into some kind of internal representation (e.g. maybe an "instruction data, operand 1 data, operand 2 data, ..." thing) where the internal representation for instructions like "jmp foo" and "mov eax,bar/5+33" can be simplified too much and needs to include some reference to a symbol in the symbol table.

For the symbol table itself, each entry has a symbol name (e.g. "foo"), which section it is in, the lowest possible offset within the section and the highest possible offset within the section. When the lowest possible offset and highest possible offset match, and the section has a known address, the assembler can replace references to that symbol in the internal representation with an actual value.

Note that there are cases where you can't know how large an instruction will be until later (e.g. for 80x86; "jmp foo" could be a 2 byte instruction if the target address is close, but may need to be a 3 byte instruction or 5 byte instruction if the target address isn't close, and you can't decide until you know something about the value that "foo" will have); and when you can't know how large an instruction will be you can't know the offset of any symbols that occur later in the same section. This is why you end up wanting symbols to have both lowest possible offset and highest possible offset - so that even when you don't know the actual offset of a symbol you can still know that the offset will be small enough or too large and can still determine out how big an instruction will be (and get a better idea of the values of later symbols in that section).

More specifically; while assembling you want to do multiple passes, where each pass tries to convert the intermediate representations of each instruction into more specific/complete versions and tries to improve the lowest possible offset and highest possible offset values for symbols (so that you have more/better information that the next pass can use).

When you have finished doing the "multiple passes" and the assembler is generating a flat binary and no linker is involved; everything will be known (including the address of sections and offset of all symbols within sections, and will have converted all instructions into actual bytes) and you can generate the final file.

When you have finished doing the "multiple passes" and the assembler is generating an object file; some things will not be known (the address of sections) and some things will be known (the offset of all symbols within sections, the size of all instructions); and the object file format will provide a way for you to provide details of things you don't/can't know (e.g. a list of things that need fixing, and information the linker can use to fix them) that you can provide from what's left of the intermediate representation of instructions and the symbol table.

Note that there can be cases that are too complex for an object file format to support (e.g. probably the "mov eax,bar/5+33" from earlier), where an instruction that can be assembled without any problem (if the assembler is generating a flat binary) has to be treated as an error (if the assembler is generating an object file). You will discover these cases (and generate appropriate error messages) when trying to create the object file.

Note that this all fits into a nice "3 phases" arrangement, where the "front-end" converts the "plain text" input into the intermediate representation, the "middle-end" (the multiple passes) refines the intermediate representation as much as possible, and the "back-end" generates a file. Only the back-end needs to care what the target file format is.

answered Oct 04 '22 09:10

Brendan

Related questions
                            
                                Current Program Status Register exception modes
                            
                                assembly code of the c function
                            
                                MIPS: Reading a string from command line argument
                            
                                Do the MMX registers always exist in modern processors?
                            
                                x86 assembly "push OFFSET ..." and mnemonics?
                            
                                What does this assembly code do? (TEST,XOR,JNZ)
                            
                                GCC INLINE ASSEMBLY Won't Let Me Overwrite $esp
                            
                                There must be a really fast way to calculate this bitwise expression?
                            
                                What is data type and how is it implemented?
                            
                                How do I correctly use the mod operator in MIPS?
                            
                                Literals VS Immediate Operands
                            
                                Can't clear entire screen in 16-bit real mode Assembly
                            
                                Adding two vector in assembly x86_64 with AVX2 plus technical clarifications
                            
                                In NASM labels next to each other in memory are printing both strings instead of first one
                            
                                Comma, colon, decorator or end of line expected after operand
                            
                                Unsigned int to unsigned long long well defined?
                            
                                Why do 32-bit applications work on 64-bit x86 CPUs?
                            
                                Error 13: Invalid or unsupported executable while booting simple kernel in grub with string literal
                            
                                Regarding cmp / jg, jle, etc in AT&T syntax assembly
                            
                                Why does ARM distinguish between SDIV and UDIV but not with ADD, SUB and MUL?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How does assembler compute segment and offset for symbol addresses?

Tags:

assembly

x86-16

compiler-construction

memory-segmentation

masm

user152531

People also ask

2 Answers

Ross Ridge

Brendan

Recent Activity

Donate For Us