After reading this stack overflow answer, and this document, I still don't understand the difference between movq
and movabsq
.
My current understanding is that in movabsq
, the first operand is a 64-bit immediate operand whereas movq
sign-extends a 32-bit immediate operand. From the 2nd document referenced above:
Moving immediate data to a 64-bit register can be done either with the
movq
instruction, which will sign extend a 32-bit immediate value, or with themovabsq
instruction, when a full 64-bit immediate is required.
In the first reference, Peter states:
Interesting experiment:
movq $0xFFFFFFFF, %rax
is probably not encodeable, because it's not representable with a sign-extended 32-bit immediate, and needs either the imm64 encoding or the%eax
destination encoding.(editor's note: this mistaken assumption is fixed in the current version of that answer).
However, when I assemble/run this it seems to work fine:
.section .rodata
str:
.string "0x%lx\n"
.text
.globl main
main:
pushq %rbp
movq %rsp, %rbp
movl $str, %edi
movq $0xFFFFFFFF, %rsi
xorl %eax, %eax
call printf
xorl %eax, %eax
popq %rbp
ret
$ clang file.s -o file && ./file
prints 0xffffffff
. (This works similarly for larger values, for instance if you throw in a few additional "F"s). movabsq
generates an identical output.
Is Clang inferring what I want? If it is, is there still a benefit to movabsq
over movq
?
Did I miss something?
A 32-bit processor on x86 architecture has 32-bit registers, while 64-bit processors have 64-bit registers. Thus, x64 allows the CPU to store more data and access it faster.
x86-64 (also known as x64, x86_64, AMD64, and Intel 64) is a 64-bit version of the x86 instruction set, first released in 1999. It introduced two new modes of operation, 64-bit mode and compatibility mode, along with a new 4-level paging mode.
x86-64 is a 64-bit processing technology developed by AMD that debuted with the Opteron and Athlon 64 processor. x86-64 is also known as x64 and AMD64. x86-64 enables 64-bit processing advantages such as increased memory space (up to 256TB) and processing more data per clock cycle.
There are three kind of moves to fill a 64-bit register:
Moving to the low 32-bit part: B8 +rd id
, 5 bytes
Example: mov eax, 241
/ mov[l] $241, %eax
Moving to the low 32-bit part will zero the upper part.
Moving with a 64-bit immediate: 48 B8 +rd io
, 10 bytes
Example: mov rax, 0xf1f1f1f1f1f1f1f1
/ mov[abs][q] $0xf1f1f1f1f1f1f1f1, %rax
Moving a full 64-bit immediate.
Moving with a sign-extended 32-bit immediate: 48 C7 /0 id
, 7 bytes
Example: mov rax, 0xffffffffffffffff
/ mov[q] $0xffffffffffffffff, %rax
Moving a signed 32-bit immediate to full 64-bit register.
Notice how at the assembly level there is room for ambiguity, movq
is used for the second and third case.
For each immediate value we have:
All the cases but the third have at least two possible encoding.
The assembler picks up the shortest one usually if more than one encoding is available but that's not always the case.
For GAS:movabs[q]
always correspond to (2).mov[q]
corresponds to (3) for the cases (a) and (d), to (2) for the other cases.
It never generate (1) for a move to a 64-bit register.
To make it pick up (1) we have to use mov[l] $0xffffffff, %edi
which is equivalent (I believe GAS won't convert a move to a 64-bit register to one to its lower 32-bit register even when this is equivalent).
In the 16/32-bit era distinguishing between (1) and (3) was not considered really important (yet in GAS it's possible to pick one specific form) since it was not a sign-extend operation but an artefact of the original encoding in the 8086.
The mov
instruction was never split into two forms to account for (1) and (3), instead a single mov
was being used with the assembler almost always picking (1) over (3).
With the new 64-bit registers having 64-bit immediates would make the code far too sparse (and would easily violate the current maximum instruction length of 16 bytes) so it was not worth it to extend (1) to always take 64-bit immediate.
Instead (1) still have 32-bit immediate and zero-extends (to break any false data dependency) and (2) was introduced for the rare case where a 64-bit immediate operand is actually needed.
Taking the chance, (3) was also changed to still take a 32-bit immediate but to also sign-extend it.
(1) and (3) should suffice for the most common immediates (like 1 or -1).
However the difference between (1)/(3) and (2) is deeper than the past difference between (1) and (3) because while (1) and (3) both have an operand of the same size, 32-bit, (3) has a 64-bit immediate operand.
Why would one want an artificially lengthened instruction?
As described in the linked answer, one use case could be padding so that the top of the next loop is at a multiple of 16/32 bytes, without needing any NOP instructions.
This sacrifices code density (more space in the instruction cache) and decode efficiency outside the loop for better front-end efficiency for each loop iteration. But longer instructions are still generally cheaper for the front-end than having to decode some NOPs as well.
Another, and more frequent, use case is when one only need to generate a machine code template.
For example in a JIT one may want to prepare the sequence of instructions to use and fill the immediates values only at runtime.
In that case using (2) will greatly simplify the handling since there is always enough room for all the possible values.
Another case is for some patching functionality, in a debug version of a software specific calls could be made indirectly with an address in a register that has just been loaded with (2) so that the debugger can hijack the call easily to any new target.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With