Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

why we can't move a 64-bit immediate value to memory?

First I am a little bit confused with the differences between movq and movabsq, my text book says:

The regular movq instruction can only have immediate source operands that can be represented as 32-bit two’s-complement numbers. This value is then sign extended to produce the 64-bit value for the destination. The movabsq instruction can have an arbitrary 64-bit immediate value as its source operand and can only have a register as a destination.

I have two questions to this.

Question 1

The movq instruction can only have immediate source operands that can be represented as 32-bit two’s-complement numbers.

so it means that we can't do

movq    $0x123456789abcdef, %rbp

and we have to do:

movabsq $0x123456789abcdef, %rbp

but why movq is designed to not work for 64 bits immediate value, which is really against the purpose of q (quard word), and we need to have another movabsq just for this purpose, isn't that hassle?

Question 2

Since the destination of movabsq has to be a register, not memory, so we can't move a 64-bit immediate value to memory as:

movabsq $0x123456789abcdef, (%rax)

but there is a workaround:

movabsq $0x123456789abcdef, %rbx
movq    %rbx, (%rax)   // the source operand is a register, not immediate constant, and the destination of movq can be memory

so why the rule is designed to make things harder?

like image 323
amjad Avatar asked Dec 13 '22 08:12

amjad


1 Answers

Yes, mov to a register then to memory for immediates that won't fit in a sign-extended 32-bit, unlike -1 aka 0xFFFFFFFFFFFFFFFF. The why part is interesting question, though:


Remember that asm only lets you do what's possible in machine code. Thus it's really a question about ISA design. Such decisions often involve what's easy for the hardware to decode, as well as encoding efficiency considerations. (Using up opcodes on rarely-used instructions would be bad.)

It's not designed to make things harder, it's designed to not need any new opcodes for mov. And also to limit 64-bit immediates to one special instruction format. mov is the only instruction that can ever use a 64-bit immediate at all (or a 64-bit absolute address, for load/store of AL/AX/EAX/RAX).

Check out Intel's manual for the forms of mov (note that it uses Intel syntax, destination first, and so will my answer.) I also summarized the forms (and their instruction lengths) in Difference between movq and movabsq in x86-64, as did @MargaretBloom in answer to What's the difference between the x86-64 AT&T instructions movq and movabsq?.

Allowing an imm64 along with a ModR/M addressing mode would also make it possible to run into the 15-byte upper limit on instruction length pretty easily, e.g. REX + opcode + imm64 is 10 bytes, and ModRM+SIB+disp32 is 6. So mov [rdi + rax*8 + 1234], imm64 would not be encodeable even if there was an opcode for mov r/m64, imm64.

And that's assuming they repurposed one of the 1-byte opcodes that were freed up by making some instructions invalid in 64-bit mode (e.g. aaa), which might be inconvenient for the decoders (and instruction-length pre-decoders) because in other modes those opcodes don't take a ModRM byte or an immediate.


movq is for the forms of mov with a normal ModRM byte to allow an arbitrary addressing mode as the destination. (Or as the source for movq r64, r/m64). AMD chose to keep the immediate for these as 32-bit, same as with 32-bit operand size1.

These forms of mov are the same instruction format as other instructions like add. For ease of decoding, this means a REX prefix doesn't change the instruction-length for these opcodes. Instruction-length decoding is already hard enough when the addressing mode is variable-length.

So movq is 64-bit operand-size but otherwise the same instruction format mov r/m64, imm32 (becoming the sign-extended-immediate form, same as every other instruction which only has one immediate form), and mov r/m64, r64 or mov r64, r/m64.

movabs is the 64-bit form of the existing no-ModRM short form mov reg, imm32. This one is already a special case (because of the no-modrm encoding, with register number from the low 3 bits of the opcode byte). Small positive constants can just use 32-bit operand-size for implicit zero-extension to 64-bit with no loss of efficiency (like 5-byte mov eax, 123 / AT&T mov $123, %eax in 32 or 64-bit mode). And having a 64-bit absolute mov is useful so it makes sense AMD did that.

Since there's no ModRM byte, it can only encode a register destination. It would take a whole different opcode to add a form that could take a memory operand.


From one POV, be grateful you get a mov with 64-bit immediates at all; RISC ISAs like AArch64 (with fixed-width 32-bit instructions) need more like 4 instructions just to get a 64-bit value into a register. (Unless it's a repeating bit-pattern; AArch64 is actually pretty cool. Unlike earlier RISCs like MIPS64 or PowerPC64)

If AMD64 was going to introduce a new opcode for mov, mov r/m, sign_extended_imm8 would be vastly more useful to save code-size. It's not at all rare for compilers to emit multiple mov qword ptr [rsp+8], 0 instructions to zero a local array or struct, each one containing a 4-byte 0 immediate. Putting a non-zero small number in a register is fairly common, and would make mov eax, 123 a 3-byte instruction (down from 5), and mov rax, -123 a 4-byte instruction (down from 7). It would also make zeroing a register without clobbering FLAGS 3 bytes.

Allowing mov imm64 to memory would be useful rarely enough that AMD decided it wasn't worth making the decoders more complex. In this case I agree with them, but AMD was very conservative with adding new opcodes. So many missed opportunities to clean up x86 warts, like widening setcc would have been nice. But I think AMD wasn't sure AMD64 would catch on, and didn't want to be stuck needing a lot of extra transistors / power to support a feature if people didn't use it.

Footnote 1:
32-bit immediates in general is pretty obviously a good decision for code-size. It's very rare to want to add an immediate to something that's outside the +-2GiB range. It could be useful for bitwise stuff like AND, but for setting/clearing/flipping a single bit the bts / btr / btc instructions are good (taking a bit-position as an 8-bit immediate, instead of needing a mask). You don't want sub rsp, 1024 to be an 11-byte instruction; 7 is already bad enough.


Giant instructions? Not very efficient

At the time AMD64 was designed (early 2000s), CPUs with uop caches weren't a thing. (Intel P4 with a trace cache did exist, but in hindsight it was regarded as a mistake.) Instruction fetch/decode happens in chunks of up-to-16 bytes, so having one instruction that's nearly 16 bytes isn't much better for the front-end than movabs $imm64, %reg.

Of course if the back-end isn't keeping up with the front-end, that bubble of only 1 instruction decoded this cycle can be hidden by buffering between stages.

Keeping track of that much data for one instruction would also be a problem. The CPU has to put that data somewhere, and if there's a 64-bit immediate and a 32-bit displacement in the addressing mode, that's a lot of bits. Normally an instruction needs at most 64-bits of space for an imm32 + a disp32.


BTW, there are special no-modrm opcodes for most operations with RAX and an immediate. (x86-64 evolved out of 8086, where AX/AL was more special, see this for more history and explanation). It would have been a plausible design for those add/sub/cmp/and/or/xor/... rax, sign_extended_imm32 forms with no ModRM to instead use a full imm64. The most common case for RAX, immediate uses an 8-bit sign-extended immediate (-128..127), not this form anyway, and it only saves 1 byte for instructions that need a 4-byte immediate. If you do need an 8-byte constant, though, putting it in a register or memory for reuse would be better than doing a 10-byte and-imm64 in a loop, though.

like image 106
Peter Cordes Avatar answered Dec 31 '22 13:12

Peter Cordes