Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does assembler compute segment and offset for symbol addresses?

I have learned about compilers and assembly language, so I'd like to write my own assembler as an exercise. But there I have some questions;

How can I compute the address for segments such as @DATA or like OFFSET/ADDR VarA?

Take an easy assembly program as an example:

    .model small
    .stack 1024
    .data
          msg db 128 dup('A')
    .code
    start:
        mov ax,@data
        mov ax,ds
        mov dx, offset msg
                           ; DS:DX points at msg
        mov ah,4ch
        int 21h            ; exit program without using msg
    end

So how does the assembler calculate the segment address for the @data segment?

And how does it know what to put into the immediate for mov dx, offset msg?

like image 677
user152531 Avatar asked Apr 20 '15 13:04

user152531


People also ask

How does offset work in assembly?

In assembly language In computer engineering and low-level programming (such as assembly language), an offset usually denotes the number of address locations added to a base address in order to get to a specific absolute address.

What is segmentation in assembly?

A segmented memory model divides the system memory into groups of independent segments referenced by pointers located in the segment registers. Each segment is used to contain a specific type of data.

What is offset in emu8086?

The offset represents the distance, in bytes, of the label from the beginning of the data segment.

How does MOV work in assembly?

The mov instruction copies the data item referred to by its second operand (i.e. register contents, memory contents, or a constant value) into the location referred to by its first operand (i.e. a register or memory). While register-to-register moves are possible, direct memory-to-memory moves are not.


2 Answers

The assembler doesn't know where @data and msg will end up in memory so generates metadata called relocations (or "fixups") in the object (.OBJ) file that allow the linker and operating system to fill in the correct values.

Lets take a look at what happens with a slightly different example program:

.model small
.stack 1024
.data
    msg db 'Hello, World!,'$'
.code
start:
    mov ax,SEG msg
    mov ds,ax
    mov dx,OFFSET msg
    mov ah,09h
    int 21h              ; write string in DS:DX to stdout
    mov ah,4ch
    int 21h              ; exit(AL)
end start

When assembling this file the assembler has no way knowing where the linker will put anything defined by this example program. It may appear obvious to you, but the assembler can't assume it seeing a complete program. The assembler doesn't know if you'll link it with other object files or libraries which could cause the linker to put msg somewhere other than the start of the data segment.

So when this example program gets assembled into an object file, the assembler generates two relocation records. If you use MASM to assemble the file you can see this in listing file generated with the /Fl switch:

 ; listing of the .obj assembler output, before linking
 0000               start:
 0000  B8 ---- R            mov ax,SEG msg
 0003  8E D8                mov ds,ax
 0005  BA 0000 R            mov dx,OFFSET msg
 0008  B4 09                mov ah,09h

The R next to the operand in the machine code column of the listing indicates they have relocations the refer to them. When the linker creates the MS-DOS format executable from the object file it will able to supply correct offset from the start of the data segment for msg. That value is a link-time constant so only the .obj, not the .exe, needs a relocation for it.

However the linker won't be able to supply the location of the segment of msg (the data segment) because the linker doesn't know where MS-DOS will load the executable into memory. (Unlike under a modern mainstream OS where every process has its own virtual address space, real mode has only one address space that programs have to share with device drivers and TSRs, and the OS itself.)

So the linker will put a relocation in the generated executable that tells MS-DOS to adjust the immediate operand based on where it gets loaded.


Note that you might want to simply your assembler writing exercise by writing one that only works with complete programs and generates only .COM executables. That way you don't have to worry about relocations. Your assembler will decide where everything gets placed within the single segment allowed by the .COM format. Note that because .COM files don't support segment relocations, instructions like mov ax,@data or mov ax,SEG msg can't be used. Instead, CS=DS=ES=SS on program startup, with a value chosen by the OS's program loader. (And that value isn't known at assemble time.)

like image 73
Ross Ridge Avatar answered Oct 04 '22 10:10

Ross Ridge


How can I compute the address for segments such as @DATA or like OFFSET/ADDR VarA?

There are 2 cases:

a) the assembler is generating a flat binary or executable file itself, and no linker is involved

b) the assembler is generating an object file to be sent to a linker later

Note that you can have a mixture. For example, in some assemblers (e.g, NASM) there's keywords to create a temporary section (e.g. absolute) and structures are supported by internally using a temporary section (a field in a structure is an offset into a temporary section that begins at address zero).

For both cases; the assembler converts the source code into some kind of internal representation (e.g. maybe an "instruction data, operand 1 data, operand 2 data, ..." thing) where the internal representation for instructions like "jmp foo" and "mov eax,bar/5+33" can be simplified too much and needs to include some reference to a symbol in the symbol table.

For the symbol table itself, each entry has a symbol name (e.g. "foo"), which section it is in, the lowest possible offset within the section and the highest possible offset within the section. When the lowest possible offset and highest possible offset match, and the section has a known address, the assembler can replace references to that symbol in the internal representation with an actual value.

Note that there are cases where you can't know how large an instruction will be until later (e.g. for 80x86; "jmp foo" could be a 2 byte instruction if the target address is close, but may need to be a 3 byte instruction or 5 byte instruction if the target address isn't close, and you can't decide until you know something about the value that "foo" will have); and when you can't know how large an instruction will be you can't know the offset of any symbols that occur later in the same section. This is why you end up wanting symbols to have both lowest possible offset and highest possible offset - so that even when you don't know the actual offset of a symbol you can still know that the offset will be small enough or too large and can still determine out how big an instruction will be (and get a better idea of the values of later symbols in that section).

More specifically; while assembling you want to do multiple passes, where each pass tries to convert the intermediate representations of each instruction into more specific/complete versions and tries to improve the lowest possible offset and highest possible offset values for symbols (so that you have more/better information that the next pass can use).

When you have finished doing the "multiple passes" and the assembler is generating a flat binary and no linker is involved; everything will be known (including the address of sections and offset of all symbols within sections, and will have converted all instructions into actual bytes) and you can generate the final file.

When you have finished doing the "multiple passes" and the assembler is generating an object file; some things will not be known (the address of sections) and some things will be known (the offset of all symbols within sections, the size of all instructions); and the object file format will provide a way for you to provide details of things you don't/can't know (e.g. a list of things that need fixing, and information the linker can use to fix them) that you can provide from what's left of the intermediate representation of instructions and the symbol table.

Note that there can be cases that are too complex for an object file format to support (e.g. probably the "mov eax,bar/5+33" from earlier), where an instruction that can be assembled without any problem (if the assembler is generating a flat binary) has to be treated as an error (if the assembler is generating an object file). You will discover these cases (and generate appropriate error messages) when trying to create the object file.

Note that this all fits into a nice "3 phases" arrangement, where the "front-end" converts the "plain text" input into the intermediate representation, the "middle-end" (the multiple passes) refines the intermediate representation as much as possible, and the "back-end" generates a file. Only the back-end needs to care what the target file format is.

like image 22
Brendan Avatar answered Oct 04 '22 09:10

Brendan