Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

MOVZX missing 32 bit register to 64 bit register

Here's the instruction which copies (converts) unsigned registers: http://www.felixcloutier.com/x86/MOVZX.html

Basically the instruction has 8->16, 8->32, 8->64, 16->32 and 16->64.

Where's the 32->64 conversion? Do I have to use the signed version for that?
If so how do you use the full 64 bits for an unsigned integer?

like image 615
Ryan Brown Avatar asked Jul 17 '18 17:07

Ryan Brown


1 Answers

Short answer

Use mov eax, edi to zero-extend EDI into RAX if you can't already guarantee that the high bits of RDI are all zero. See: Why do x86-64 instructions on 32-bit registers zero the upper part of the full 64-bit register?

Prefer using different source/destination registers, because mov-elimination fails for mov eax,eax on both Intel and AMD CPUs. When moving to a different register you incur zero latency with no execution unit needed. (gcc apparently doesn't know this and usually zero-extends in place.) Don't spend extra instructions to make that happen, though.


Long answer

Machine-code reason why there's no encoding for movzx with a 32-bit source

summary: Every different source width for movzx and movsx needs a different opcode. The destination width is controlled by prefixes. Since mov can do the job, a new opcode for movzx dst, r/m32 would be redundant.

When designing AMD64 assembler syntax, AMD chose not to make movzx rax, edx work as a pseudo-instruction for mov eax, edx. This is probably a good thing, because knowing that writing a 32-bit register zeros the upper bytes is very important to writing efficient code for x86-64.


AMD64 did need a new opcode for sign extension with a 32-bit source operand. They named the mnemonic movsxd for some reason, instead of making it a 3rd opcode for the movsx mnemonic. Intel documents them all together in one ISA ref manual entry. They repurposed the 1-byte opcode that was ARPL in 32-bit mode, so movsxd is actually 1 byte shorter than movsx from 8 or 16-bit sources (assuming you still need a REX prefix to extend to 64-bit).

Different destination sizes use the same opcode with different operand size1. (66 or REX.W prefix for 16-bit or 64-bit instead of the default 32 bit.) e.g. movsx eax, bl and movsx rax, bl differ only in the REX prefix; same opcode. (movsx ax, bl is also the same, but with a 66 prefix to make the operand-size 16 bit.)

Before AMD64, there was no need for an opcode that reads a 32-bit source, because the maximum destination width was 32 bits, and "sign-extension" to the same size is just a copy. Notice that movsxd eax, eax is legal but not recommended. You can even encode it with a 66 prefix to read a 32-bit source and write a 16-bit destination2.

The use of MOVSXD without REX.W in 64-bit mode is discouraged. Regular MOV should be used instead of using MOVSXD without REX.W.

32->64 bit sign extension can be done with cdq to sign-extend EAX into EDX:EAX (e.g. before 32-bit idiv). This was the only way before x86-64 (other than of course copying and using an arithmetic right shift do broadcast the sign bit).


But AMD64 already zero-extends from 32 to 64 for free with any instruction that writes a 32-bit register. This avoids false dependencies for out-of-order execution, which is why AMD broke with the 8086 / 386 tradition of leaving upper bytes untouched when writing a partial register. (Why doesn't GCC use partial registers?)

Since each source width needs a different opcode, no prefixes can make either of the two movzx opcodes read a 32-bit source.


You do sometimes need to spend an instruction to zero-extend something. It's common in compiler output for small functions, because the x86-64 SysV and Windows x64 calling conventions allow high garbage in args and return values.

As usual, ask a compiler if you want to know how to do something in asm, especially when you don't see instructions you're looking for. I've omitted the ret at the end of each function.

Source + asm from the Godbolt compiler explorer, for the System V calling convention (args in RDI, RSI, RDX, ...):

#include <stdint.h>

uint64_t zext(uint32_t a) { return a; }
uint64_t extract_low(uint64_t a) { return a & 0xFFFFFFFF; }
    # both compile to
    mov     eax, edi

int use_as_index(int *p, unsigned a) { return p[a]; }
   # gcc
    mov     esi, esi         # missed optimization: mov same,same can't be eliminated on Intel
    mov     eax, DWORD PTR [rdi+rsi*4]

   # clang
    mov     eax, esi         # with signed int a, we'd get movsxd
    mov     eax, dword ptr [rdi + 4*rax]


uint64_t zext_load(uint32_t *p) { return *p; }
    mov     eax, DWORD PTR [rdi]

uint64_t zext_add_result(unsigned a, unsigned b) { return a+b; }
    lea     eax, [rdi+rsi]

The default address-size is 64 in x86-64. High garbage doesn't affect the low bits of addition, so this saves a byte vs. lea eax, [edi+esi] which needs a 67 address-size prefix but gives identical results for every input. Of course, add edi, esi would produce a zero-extended result in RDI.

uint64_t zext_mul_result(unsigned a, unsigned b) { return a*b; }
   # gcc8.1
    mov     eax, edi
    imul    eax, esi

   # clang6.0
    imul    edi, esi
    mov     rax, rdi    # silly: mov eax,edi would save a byte here

Intel recommends destroying the result of a mov right away when you have the choice, freeing the microarchitectural resources that mov-elimination takes up and increasing the success-rate of mov-elimination (which isn't 100% on Sandybridge-family, unlike AMD Ryzen). GCC's choice of mov / imul is best.

Also, on CPUs without mov-elimination, the mov before imul might not be on the critical path if it's the other input that's not ready yet (i.e. if the critical path goes through the input that doesn't get moved). But mov after imul depends on both inputs so it's always on the critical path.

Of course, when these functions inline, the compiler will usually know the full state of registers, unless they come from function return values. And also it doesn't need to produce the result in a specific register (RAX return value). But if your source is sloppy with mixing unsigned with size_t or uint64_t, the compiler might be forced to emit instructions to truncate 64-bit values. (Looking at compiler asm output is a good way to catch that and figure out how to tweak the source to let the compiler save instructions.)


Footnote 1: Fun fact: AT&T syntax (which uses different mnemonics like movswl (sign-extend word->long (dword) or movzbl) can infer the destination size from the register like movzb %al, %ecx, but won't assemble movz %al, %ecx even though there's no ambiguity. So it treats movzb as its own mnemonic, with the usual operand-size suffix which can be inferred or explicit. This means each different opcode has its own mnemonic in AT&T syntax.

See also assembly cltq and movslq difference for a history lesson on redundancy between CDQE for EAX->RAX and MOVSXD for any registers. See What does cltq do in assembly? or the GAS docs for the AT&T vs. Intel menmonics for zero/sign-extension.

Footnote 2: Silly computer tricks with movsxd ax, [rsi]:

Assemblers refuse to assemble movsxd eax, eax or movsxd ax, eax, but it is possible to manually encode it. ndisasm doesn't even disassemble it (just db 0x63), but GNU objdump does. Actual CPUs decode it, too. I tried on Skylake just to make sure:

 ; NASM source                           ; register value after stepi in GDB
mov     rdx, 0x8081828384858687
movsxd  rax, edx                         ; RAX = 0xffffffff84858687
db 0x63, 0xc2        ;movsxd  eax, edx   ; RAX = 0x0000000084858687
xor     eax,eax                          ; RAX = 0
db 0x66, 0x63, 0xc2  ;movsxd  ax, edx    ; RAX = 0x0000000000008687

So how does the CPU handle it internally? Does it actually read 32 bits and then truncate to the operand-size? It turns out Intel's ISA reference manual documents the 16-bit form as 63 /r MOVSXD r16, r/m16, so movsxd ax, [unmapped_page - 2] does not fault. (But it incorrectly documents the non-REX forms as valid in compat / legacy mode; actually 0x63 decodes as ARPL there. This is not the first bug in Intel's manuals.)

This makes perfect sense: the hardware can simply decode it to the same uop as mov r16, r/m16 or mov r32, r/m32 when there's no REX.W prefix. Or not! Skylake's movsxd eax,edx (but not movsxd rax, edx) has an output dependency on the destination register, like it's merging into the destination! A loop with times 4 db 0x63, 0xc2 ; movsx eax, edx runs at 4 clocks per iteration (1 per movsxd, so 1 cycle latency). The uops are fairly evenly distributed to all 4 integer ALU execution ports. A loop with movsxd eax,edx / movsxd ebx,edx / 2 other destinations runs at ~1.4 clocks per iteration (just slightly worse than the 1.25 clocks per iteration front-end bottleneck if you use plain 4x mov eax, edx or 4x movsxd rax, edx). Timed with perf on Linux on i7-6700k.

We know that movsxd eax, edx does zero the upper bits of RAX, so it's not actually using any bits from the destination register it's waiting for, but presumably treating 16 and 32-bit similarly internally simplifies decoding, and simplifies handling of this corner case encoding that nobody should ever use. The 16-bit form always has to actually merge into the destination, so it does have a true dependency on the output reg. (Skylake doesn't rename 16-bit regs separately from full registers.)

GNU binutils is disassembling it incorrectly: gdb and objdump show the source operand as 32 bits, like

  4000c8:       66 63 c2                movsxd ax,edx
  4000cb:       66 63 06                movsxd ax,DWORD PTR [rsi]

when it should be

  4000c8:       66 63 c2                movsxd ax,dx
  4000cb:       66 63 06                movsxd ax,WORD PTR [rsi]

In AT&T syntax, objdump amusingly still uses movslq. So I guess it treats that as a whole mnemonic, not as a movsl instruction with a q operand-size. Or that's just the result of nobody caring about that special case that gas won't assemble anyway (it rejects movsll, and checks register widths for movslq).

Before checking the manual, I actually tested on Skylake with NASM to see if a load would fault or not. It of course does not:

section .bss
    align 4096
    resb 4096
unmapped_page: 
 ; When built into a static executable, this page is followed by an unmapped page on my system,
 ; so I didn't have to do anything more complicated like call mmap

 ...
_start:
    lea     rsi, [unmapped_page-2]
    db 0x66, 0x63, 0x06  ;movsxd  ax, [rsi].  Runs without faulting on Skylake!  Hardware only does a 2-byte load

    o16 movsxd  rax, dword [rsi]  ; REX.W prefix takes precedence over o16 (0x66 prefix); this faults
    mov      eax, [rsi]            ; definitely faults if [rsi+2] isn't readable

Note that movsx al, ax isn't possible: byte operand-size needs a separate opcode. Prefixes only select between 32 (default), 16-bit (0x66) and in long mode 64-bit (REX.W). movs/zx ax, word [mem] has been possible since 386, but reading a source wider than the destination is a corner case that's new in x86-64, and only for sign-extension. (And it turns out that the 16-bit destination encoding actually only reads a 16-bit source.)


Other ISA-design possibilities that AMD chose not to do:

BTW, AMD could have (but didn't) design AMD64 to always sign-extend instead of always zero-extend on 32-bit register writes. It would have been less convenient for software in most cases, and probably also take a few extra transistors, but it would still avoid false dependencies on the old value that was sitting around in a register. It might add an extra gate delay somewhere because the upper bits of the result depend on the low bits, unlike zero-extension where they only depend on the fact that it's a 32-bit operation. (But that's probably unimportant.)

If AMD had designed it that way, they'd have needed a movzxd instead of movsxd. I think the major downside to this design would be needing extra instructions when packing bitfields into a wider register. Free zero extension is handy for shl rax,32 / or rax, rdx after a rdtsc that writes edx and eax, for example. If it was sign-extension, you'd need an instruction to zero the upper bytes of rdx before the or.


Other ISAs have made different choices: MIPS III (in ~1995) extended the architecture to 64 bits without introducing a new mode. Very unlike x86, there was enough opcode space left unused in the fixed-width 32-bit instruction word format.

MIPS started out as a 32-bit architecture, and never had any legacy partial-register stuff the way 32-bit x86 did from its 16-bit 8086 heritage, and from 8086's full support of 8-bit operand-size with AX = AH:AL partial regs and so on for easy porting of 8080 source code.

MIPS 32-bit arithmetic instructions like addu on 64-bit CPUs require their inputs to be correctly sign-extended, and produce sign-extended outputs. (Everything just works when running legacy 32-bit code unaware of the wider registers, because shifts are special.)

ADDU rd, rs, rt (from the MIPS III manual, page A-31)

Restrictions:
On 64-bit processors, if either GPR rt or GPR rs do not contain sign-extended 32-bit values (bits 63..31 equal), then the result of the operation is undefined.

Operation:

  if (NotWordValue(GPR[rs]) or NotWordValue(GPR[rt])) then UndefinedResult() endif
  temp ←GPR[rs] + GPR[rt]
  GPR[rd]← sign_extend(temp31..0)

(Note that U for unsigned in addu is really a misnomer, as the manual points out. You use it for signed arithmetic too unless you actually want add to trap on signed overflow.)

There's a DADDU instruction for double-word ADDU, which does what you'd expect. Similarly DDIV/DMULT/DSUBU, and DSLL and other shifts.

Bitwise operations stay the same: the existing AND opcode becomes a 64-bit AND; no need for a 64-bit AND but also no free sign-extending of 32-bit AND results.

MIPS 32-bit shifts are special (SLL is a 32-bit shift. DSLL is a separate instruction).

SLL Shift Word Left Logical

Operation:

s ← sa
temp ← GPR[rt] (31-s)..0 || 0 s
GPR[rd]← sign_extend(temp)

Programming Notes:
Unlike nearly all other word operations the input operand does not have to be a properly sign-extended word value to produce a valid sign-extended 32-bit result. The result word is always sign extended into a 64-bit destination register; this instruction with a zero shift amount truncates a 64-bit value to 32 bits and sign extends it.

I think SPARC64 and PowerPC64 are similar to MIPS64 in maintaining sign-extension of narrow results. Code-gen for (a & 0x80000000) +- 12315 for int a (with -fwrapv so compilers can't assume that a is non-negative because of signed-overflow UB) shows clang for PowerPC64 maintaining or redoing sign extension, and clang -target sparc64 ANDing then ORing to ensure that only the right bits in the low 32 are set, again maintaining sign-extension. Changing the return type or arg type to long or adding L suffixes on the AND mask constant results in code differences for MIPS64 and PowerPC64 and sometimes SPARC64; maybe only MIPS64 actually faults on 32-bit instructions with inputs that aren't correctly sign-extended, while on others it's just a software calling-convention requirement.

But AArch64 takes an approach more like x86-64, with w0..31 registers being the low half of x0..31, and instructions available in two operand-sizes.

This whole section about MIPS has nothing to do with x86-64, but it's an interesting comparison to look at the different (better IMO) design decision made by AMD64.

I included MIPS64 compiler output in the Godbolt link above, for those sample functions. (And a few others that tell us more about the calling convention, and what compilers.) It often needs dext to zero-extend from 32 to 64 bit; but that instruction wasn't added until mips64r2. With -march=mips3, return p[a] for unsigned a has to use two doubleword shifts (left then right by 32 bits) to zero extend! It also needs an extra instruction to zero-extend add results, i.e. to implement casting from unsigned to uint64_t.

So I think we can be glad that x86-64 was designed with free zero-extension instead of only providing 64-bit operand size for some things. (Like I said, x86's heritage is very different; it already had variable operand sizes for the same opcode using prefixes.) Of course, better bitfield instructions would be nice. Some other ISAs, like ARM and PowerPC put x86 to shame for efficient bit-field insert / extract.

like image 162
Peter Cordes Avatar answered Sep 28 '22 18:09

Peter Cordes