Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the difference between a value and a memory address in x86_64 AT&T assembly language?

Novice here, with a frustratingly simple question. I'm trying to learn assembly and this has been a stumbling block for me for so long, I would really appreciate help with these concepts. Thanks so much.

Take the following statement:

movq $5, %rax

This is moving the value 5 itself into the register %rax, yes? That is to say, if I subsequently use %rax in an addition statement, it's going to treat that as the number 5 itself -- it's not going to try to add some memory address -- it's going to add the actual value 5. If I wanted to treat the number as a memory address, I'd have to leave the dollar sign off, yes? Then it would be treated as a memory address, not a numerical value, right?

And yet, if I define a label:

.section .data

my_number:
    .quad 5

and use the label to write the same statement:

movq my_number, %rax

suddenly everything is inverted. I now have to omit the dollar sign to get the same result. Why?

This statement is going to mov the value 5 itself into %rax again, just like the previous statement, right? If I were to use the dollar sign before my number, then I'd get the memory address. Which is the opposite of how it worked before. Before, using the literal, the dollar sign gave me the integer value (5), and leaving the dollar sign off gave me the memory address. Now, in the example with my_number, leaving the dollar sign off gives me value, and using it gives me the memory address. Why the change? What happened?

It seems to me that the function of the dollar sign completely reverses itself from one example (movq $5, %rax) to another (movq my_number, %rax). These two instructions have the same functionality, they do the same thing, so why does one require the dollar sign and the other doesn't? Obviously my understanding of the concepts of values versus memory addresses has some major flaw, and I just haven't been able to identify it despite literally many, many hours of reading through forums, programming books, instructional videos, etc. -- several times in the past I gave up when I reached this point because I couldn't find an answer. Every time I try to revisit assembly language I reach this same obstacle.

Please help. Thank you in advance.

like image 778
Andrew Boone Avatar asked Dec 02 '25 01:12

Andrew Boone


2 Answers

The first thing to understand is that the syntax is what it is. You can try to find explanations for why it is the way it is, but it's kind of hard to force it into a logical system that doesn't necessarily exist. Neverthless, I have written about why the syntax is the way it is and how exactly addressing modes work before.

That said, to resolve your question, here is a way to think about it: First, the value of a symbol is its address, not whatever is stored at that address. The assembler doesn't distinguish values from addresses. For all it knows, if you write

        movq $5, %rax

you are loading the address 5 to register rax. Or the value 5. Who knows? Not the assembler. If you write

foo:    .quad 5

The value 5 will be placed somewhere in memory and the symbol foo will be assigned its address. Writing foo does not have the same effect as writing 5 because foo is where 5 is stored, not 5 itself. Of course you can also make foo resolve to the address 5 by writing

        .equ foo, 5

or equivalently

foo=    5

This sets the address of foo to 5 and does not allocate any memory.

Now why is the $ “decoration” needed in some cases but not in others?

Operands to an instruction always have an addressing mode. That is, they specify how the operand is obtained. An operand that starts with $ is an immediate operand, i.e. its value is encoded into the instruction. An operand that is just a plain expression (something like 5, foo, or 5+foo) is an absolutely addressed memory operand, i.e. the operand is an absolute address at which the value is found.

Directives (things like .quad, but also assignment through =) however do not have addressing modes. They just take expressions and then do something with the expression, like place its value into memory. Therefore, their operands look like absolutely addressed memory operands, but aren't. They are just expressions with no syntactically implied addressing mode.

So that's why a naked expression some times indicates a memory reference and some times seems to indicate an immediate value. Context matters.

There are other syntaxes like Plan 9 syntax where directives do take addressing modes. For example, in Plan 9 syntax you'd write

DATA my_number(SB)/8, $5

with both an addressing mode for my_number and for the immediate 5 to write what AT&T syntax does with

my_number: .quad 5

However, that's Plan 9 syntax, not AT&T syntax. It's different.

like image 95
fuz Avatar answered Dec 04 '25 18:12

fuz


The code that gets generated for (A):

mov $5, %rax

is fundamentally different than the code generated for (B):

mov my_number, %rax.

Both will have the result of putting the number 5 into rax, but A will generate an immediate load of the number 5 into rax, while B will load the number from memory -- specifically, from the .data section of your running executable.

To see this, we can look at the generated code for each instruction. Here is your example:

# loads.s

.global test

.text

test:
    movq $5, %rax
    movq my_number, %rax
    ret

.data     # switch to the .data section.
      # Without this, my_number would be contiguous with the machine code
my_number:
    .quad 5

I assembled it with

as -o loads.o loads.s

and linked it with

ld -o loads -no-pie loads.o

and now we can view the machine code in the .text section with

objdump -dw loads:

Disassembly of section .text:

0000000000401000 <test>:
  401000:   48 c7 c0 05 00 00 00    mov    $0x5,%rax
  401007:   48 8b 04 25 30 30 40 00     mov    0x403030,%rax
  40100f:   c3                      ret

The first instruction has three leading bytes(0x48, 0xc7, 0xc0 = REX.W=1, opcode, and ModRM) that encode the instruction and operand style, and then a 4 byte immediate value: 0x05 0x00 0x00 0x00. There's our 5! (in little endian, so the 5 byte is first). It will take that 5 from the instruction stream, and put it into RAX.

The second instruction has four leading bytes(0x48, 0x8b, 0x04, 0x25), and then another four byte immediate: 0x30 0x30 0x40 0x00. This is the runtime virtual address in .data where, if all went right, a 5 will be located. The leading bytes indicate that cpu should load from that address in memory and put the result into RAX. And now our little function has accomplished nothing, and we return.


(The static executable we actually built from this source alone has nothing to return to; running it will segfault after ret pops argc (a small integer) from the stack into RIP, then tries to fetch code from that unmapped page. This source which defines a function is written to be linked into a larger program which contains a caller. We only used ld (without -pie like GCC normally passes these days) on it alone to fill in an actual absolute address into the machine code, instead of a placeholder, so we'd have a concrete example. Non-PIE was also necessary to allow this 32-bit absolute addressing mode rather than my_number(%rip) to link.)

like image 39
pjc Avatar answered Dec 04 '25 20:12

pjc



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!