Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the function of a "data label" in an x86 assembler?

Tags:

x86

assembly

masm

I'm currently learning assembly programming by following Kip Irvine's "assembly language x86 programming" book.

In the book, the authors tries to explain the concept of data label

A data label identifies the location of a variable, providing a convenient way to reference the variable in code. The following, for example, defines a variable named count:

count DWORD 100

The assembler assigns a numeric address to each label.

So my understanding of what data label does is: data label count is a variable that contain a numeric value, where the numeric value is a location in memory. When I use count in my code, I'm actually using the value contained in that location in memory, in this instance, 100.

Is my understanding of data label correct? If it is somewhat incorrect, could someone please point the mistake out?

like image 583
Thor Avatar asked Jun 25 '17 03:06

Thor


1 Answers

Labels are a symbolic way to write memory addresses, nothing more, nothing less. A label itself takes no space, and is just a handy way to let you refer to that spot in memory later.

(Well, they can also turn into symbols in an object file to allow numeric addresses to be calculated at link time, instead of at assemble time. But for labels defined and referenced in the same file, this extra complexity is mostly invisible; see below about addresses being link-time constants, not assemble-time.)

e.g.

; NASM syntax, but the concepts apply exactly to MASM as well
; For MASM, you may need  BYTE PTR or whatever size overrides in loads.
section .rodata     ; or section .data  if you want to be able to store here, too.
COUNT:
   db 0x12
FOO:
   db 0
BAR:
   dw 0x80FF      ; same as   db 0xff, 0x80

A 4-byte load like mov eax, [COUNT] will get 0x80FF0012 (since x86 is little-endian). A 2-byte load from FOO like mov cx, [FOO] will get 0xFF00.

You might actually use overlapping loads from a constant this way, e.g. with strings where some are substrings of others. For null-terminated strings, only common suffixes can be combined into the same storage space this way.


Now does this mean that COUNT is a 4-byte variable or a 1-byte variable? No, neither. Assembly language doesn't really have "variables".

Variables are a higher-level concept that you can implement in assembly language with a label and an assembler directive that reserves some static space. Notice that the labels are separate from the db directives in the example above.

But a variable doesn't need to have any static storage space: e.g. your loop counter variable can (and often should) exist only in a register.

A variable doesn't even need to have a single fixed location. It can be spilled to the stack in part of a function where it's not used, but live in registers in another part of a function. In compiler-generated code, variables often move between registers for no reason because compilers don't even try to use the same register for the same variable.


Note that MASM does implicitly associate a label with an operand-size based on the directive that follows it. So you might have to write mov eax, dword ptr [count] if mov eax, [count] gives an operand-size mismatch error.

Some people consider this a feature, but others think this magic operand-size stuff is totally weird. NASM syntax doesn't have any of this magic. You can tell how a line will assemble without having to go and find where the labels are defined. add [count], 1 is an error in NASM, because nothing implies an operand-size.

Don't get stuck into thinking that everything you'd use a variable for in C must have static storage with a label in your assembly language programs. But if you do want to use the term "variable" for static data-storage + a label like Kip Irvine does, then go ahead.


Also note that data labels are not special or different from code labels. Nothing stops you from writing jmp COUNT. Decoding 12 00 FF 80 as a (sequence of) x86 instruction(s) is left as an exercise for the reader, but (if it's in a page with execute permission), it will be fetched and decoded by the CPU.

Similarly, nothing stops you from loading data from code labels as a memory operand. It's not usually a good idea for performance reasons to mix code and data (all CPUs use split L1D and L1I caches), but that works too. In a typical OS (like Linux), the text segment of an executable contains the code and read-only data sections, and is mapped with read and execute permission. (But not write permission, so trying to store will fault unless you modified the permissions.)

A JIT-compiler writes machine code to a buffer and then jumps there. It could be a static buffer with a label, but more usually it would be a dynamically-allocated buffer whose address is a variable.


Static addresses are usually link-time constants, but often not assemble-time constants. (Unless you're writing a bootloader, or something else that is definitely loaded at a known address, then org 0x100 might be useful.) This means you can do mov al, [COUNT+2], but not mov al, [COUNT*2]. (Object-file formats support integer displacements, but not other math operators).

In PIC code, label addresses are not even link-time constants, but at least in 64-bit PIC code the offset from code to a data label is a link-time constant, so RIP-relative addressing can be used without an extra level of indirection (through the Global Offset Table).

like image 143
Peter Cordes Avatar answered Oct 01 '22 19:10

Peter Cordes