Why NASM on Linux changes registers in x86_64 assembly

Question

I am new to x86_64 assembly programming. I was writing simple "Hello World" program in x86_64 assembly. Below is my code, which runs perfectly fine.

global _start

section .data

    msg: db "Hello to the world of SLAE64", 0x0a
    mlen equ $-msg

section .text
    _start:
            mov rax, 1
            mov rdi, 1
            mov rsi, msg
            mov rdx, mlen
            syscall

            mov rax, 60
            mov rdi, 4
            syscall

Now when I disassemble in gdb, it gives below output:

(gdb) disas
Dump of assembler code for function _start:
=> 0x00000000004000b0 <+0>:     mov    eax,0x1
   0x00000000004000b5 <+5>:     mov    edi,0x1
   0x00000000004000ba <+10>:    movabs rsi,0x6000d8
   0x00000000004000c4 <+20>:    mov    edx,0x1d
   0x00000000004000c9 <+25>:    syscall
   0x00000000004000cb <+27>:    mov    eax,0x3c
   0x00000000004000d0 <+32>:    mov    edi,0x4
   0x00000000004000d5 <+37>:    syscall
End of assembler dump.

My question is why NASM behaves in such way? I know it changes instructions based on opcode, but I am not sure about same behaviour with registers.

Also does this behaviour affects functionality of executable?

I am using Ubuntu 16.04 (64 bit) installed in VMware on i5 processor.

Thank you in advance.

Margaret Bloom · Accepted Answer

In 64-bit mode mov eax, 1 will clear the upper part of the rax register (see here for an explanation) thus mov eax, 1 is semantically equivalent to mov rax, 1.

The former however spare a REX.W (48h numerically) prefix (a byte necessary to specify the registers introduced with x86-64), the opcode is the same for both instructions (0b8h followed by a DWORD or a QWORD).
So the assembler goes ahead and picks up the shortest form.

This is a typical behavior of NASM, see Section 3.3 of the NASM's manual where the example of [eax*2] is assembled as [eax+eax] to spare the disp32 field after the SIB byte¹ ([eax*2] is only encodable as [eax*2+disp32] where the assembler set disp32 to 0).

I was unable to force NASM to emit a real mov rax, 1 instruction (i.e. 48 B8 01 00 00 00 00 00 00 00) even by prefixing the instruction with o64.
If a real mov rax, 1 is needed (this is not your case), one must resort to assembling it manually with db and similar.

EDIT: Peter Cordes' answer shows that there is, in fact, a way to tell NASM not to optimize an instruction with the strict modifier.
mov rax, STRICT 1 produces the 10-byte version of the instruction (mov r64, imm64) while mov rax, STRICT DWORD 1 produces a 7-byte version (mov r64, imm32 where imm32 is sign-extended before use).

Side note: It's better to use the RIP-relative addressing, this avoids 64-bit immediate constants (thus reducing code size) and is mandatory in MacOS (in case you cared).
Change the mov esi, msg to lea esi, [REL msg] (RIP-relative is an addressing mode so it needs an "addressing", the square bracket, to avoid reading from that address we use lea that only computes the effective address but does no access).
You can use the directive DEFAULT REL to avoid typing REL in each memory access.

I was under the impression that the Mach-O file format required PIC code but this may not be the case.

¹ The Scale Index Base byte, used to encode the new addressing mode introduced back then with the 32-bit mode.

Peter Cordes · Answer

TL:DR: You can override this with

mov eax, 1 (explicitly use the optimal operand-size)
b8 01 00 00 00
mov rax, strict dword 1 (sign-extended 32-bit immediate)
48 c7 c0 01 00 00 00
mov rax, strict qword 1 (64-bit immediate like movabs in AT&T syntax)
48 b8 01 00 00 00 00 00 00 00
(Also mov rax, strict 1 is equivalent to this, and is what you get if you disable NASM optimization.)

This is a perfectly safe and useful optimization, similar to using an 8-bit immediate instead of a 32-bit immediate when you write add eax, 1.

NASM only optimizes when the shorter form of the instruction has an identical architectural effect, because mov eax,1 implicitly zeros the upper 32 bits of RAX. Note that add rax, 0 is different from add eax, 0 so NASM can't optimize that: Only instructions like mov r32,... / mov r64,... or xor eax,eax that don't depend on the old value of the 32 vs. 64-bit register can be optimized this way.

You can disable it with nasm -O1 (the default is -Ox multipass), but note that you'll get 10-byte mov rax, strict qword 1 in that case: clearly NASM isn't intended to really be used with less than normal optimization. There isn't a setting where it will use the shortest encoding that wouldn't change the disassembly (e.g. 7-byte mov rax, sign_extended_imm32 = mov rax, strict dword 1).

The difference between -O0 and -O1 is in imm8 vs. imm32, e.g. add rax, 1 is
48 83 C0 01 (add r/m64, sign_extended_imm8) with -O1, vs.
48 05 01000000 (add rax, sign_extended_imm32) with nasm -O0.
Amusingly it still optimized by picking the special-case opcode that implies an RAX destination instead of taking a ModRM byte. Unfortunately -O1 doesn't optimize immediate sizes for mov (where sign_extended_imm8 isn't possible.)

If you ever need a specific encoding somewhere, ask for it with strict instead of disabling optimization.

Note that YASM doesn't do this operand-size optimization, so it's a good idea to make the optimization yourself in the asm source, if you care about code-size (even indirectly for performance reasons) in code that could be assembled with other NASM-compatible assemblers.

For instructions where 32 and 64-bit operand size wouldn't be equivalent if you had very large (or negative) numbers, you need to use 32-bit operand-size explicitly even if you're assembling with NASM instead of YASM, if you want the size / performance advantage. The advantages of using 32bit registers/instructions in x86-64

For 32-bit constants that don't have their high bit set, zero or sign extending them to 64 bits gives an identical result. Thus it's a pure optimization to assemble mov rax, 1 to a 5-byte mov r32, imm32 (with implicit zero extension to 64 bits) instead of a 7-byte mov r/m64, sign_extended_imm32.

(See Difference between movq and movabsq in x86-64 for more details about the forms of mov x86-64 allows; AT&T syntax has a special name for the 10-byte immediate form but NASM doesn't.)

On all current x86 CPUs, the only performance difference between that and the 7-byte encoding is code-size, so only indirect effects like alignment and L1I$ pressure are a factor. Internally it's just a mov-immediate, so this optimization doesn't change the microarchitectural effect of your code either (except of course for code-size / alignment / how it packs in the uop cache).

The 10-byte mov r64, imm64 encoding is even worse for code size. If the constant actually has any of its high bits set, then it has extra inefficiency in the uop cache on Intel Sandybridge-family CPUs (using 2 entries in the uop cache, and maybe an extra cycle to read from the uop cache). But if the constant is in the -2^31 .. +2^31 range (signed 32-bit), it's stored internally just as efficiently, using only a single uop-cache entry, even if it was encoded in the x86 machine code using a 64-bit immediate. (See Agner Fog's microarch doc, Table 9.1. Size of different instructions in μop cache in the Sandybridge section)

From How many ways to set a register to zero?, you can force any of the three encodings:

mov    eax, 1                ; 5 bytes to encode (B8 imm32)
mov    rax, strict dword 1   ; 7 bytes: REX mov r/m64, sign-extended-imm32.    NASM optimizes mov rax,1 to the 5B version, but dword or strict dword stops it for some reason
mov    rax, strict qword 1   ; 10 bytes to encode (REX B8 imm64).  movabs mnemonic for AT&T.  Normally assemblers choose smaller encodings if the operand fits, but strict qword forces the imm64.

Note that NASM used the 10-byte encoding (which AT&T syntax calls movabs, and so does objdump in Intel-syntax mode) for an address which is a link-time constant but unknown at assemble time.

YASM chooses mov r64, imm32, i.e. it assumes a code-model where label addresses are 32 bits, unless you use mov rsi, strict qword msg

YASM's behaviour is normally good (although using mov r32, imm32 for static absolute addresses like C compilers do would be even better). The default non-PIC code-model puts all static code/data in the low 2GiB of virtual address space, so zero- or sign-extended 32-bit constants can hold addresses.

If you want 64-bit label addresses you should normally use lea r64, [rel address] to do a RIP-relative LEA. (On Linux at least, position-dependent code can go in the low 32, so unless you're using the large / huge code models, any time you need to care about 64-bit label addresses, you're also making PIC code where you should use RIP-relative LEA to avoid needing text relocations of absolute address constants).

i.e. gcc and other compilers would have used mov esi, msg, or lea rsi, [rel msg], never mov rsi, msg.
See How to load address of function or label into register

Why NASM on Linux changes registers in x86_64 assembly

Tags:

assembly

x86-64

micro-optimization

nasm

shellcode

Shashank Gosavi

2 Answers

Margaret Bloom

Peter Cordes

Recent Activity

Donate For Us

Why NASM on Linux changes registers in x86_64 assembly

Tags:

assembly

x86-64

micro-optimization

nasm

shellcode

Shashank Gosavi

2 Answers

Margaret Bloom

Peter Cordes

Related questions

Recent Activity

Donate For Us