Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

May there be any penalties when using 64/32-bit registers in Long mode?

Probably this is all about not even micro- but nanooptimizations, but the subject interests me and I would like to know if there are any penalties when using non-native register sizes in long mode?

I've learned from various sources, that partial register updates (like ax instead of eax) can cause eflags stall and degrade performance. But I'm not sure about the long mode. What register size is considered native for this processor operation mode? x86-64 are still extensions to x86 architecture, thus I believe 32 bits are still native. Or am I wrong?

For example, instructions like

sub eax, r14d

or

sub rax, r14

have the same size, but may there be any penalties when using either of those? May there be any penalties when mixing register sizes in consecutive instructions like the below? (assuming high dword is zero in all cases)

sub ecx, eax
sub r14, rax
like image 650
Alexander Zhak Avatar asked Oct 19 '16 21:10

Alexander Zhak


Video Answer


1 Answers

May there be any penalties when mixing 32 and 64-bit register sizes in consecutive instructions?

No, writing to a 32-bit register always zero-extends to the full register, so x86-64 avoids any partial-register penalties for 32 and 64-bit instruction.

thus I believe 32 bits are still native.

Yes, the default operand-size is 32-bit for most instructions (other than PUSH/POP). 64-bit needs a REX prefix with the W bit set to 1. So prefer 32-bit for code-size reasons. This is why compilers use mov r32, imm32 for addresses of static data (since the default code-model requires that code and static data addresses are in the low 2GiB of virtual address space).

It was a design choice by AMD. They could have chosen the other way, and required a prefix to get 32-bit operand size. Since long mode is a separate mode, x86-64 machine code can be different from x86-32 machine code however it wants. AMD chose to minimize the differences so they could share as many transistors as possible in the decoders. Your conclusion is correct, but your reasoning is totally bogus.


partial register updates (like ax instead of eax) can cause eflags stall and degrade performance.

Partial-flag stalls are separate from partial-register stalls. They're handled similarly internally (the separately-renamed parts of EFLAGS have to be merged the same as a modified AX has to be merged with the unmodified upper bytes of EAX). But one doesn't cause the other.

# partial-reg stall
setcc   al           # leaves the upper 3 (or 7) bytes unmodified
add     edx, eax     # reads full EAX.  Older CPUs stall while merging

Zeroing EAX ahead of the flag-setting and setcc with xor eax,eax avoids the partial-register penalty entirely. (Core2/Nehalem stalls for fewer cycles than earlier CPUs, but does still stall for 2 or 3c while inserting a merging uop. Sandybridge doesn't stall at all while inserting the merging uop).

(Another summary of partial register penalties on different CPUs: Why doesn't GCC use partial registers?, saying basically the same thing).

AMD doesn't suffer from partial-register stalls when reading the full register later, but instead partial-register writes and reads have a false dependency on the full register. (AMD CPUs don't rename sub-registers separately in the first place. Intel P4 and Silvermont / Knight's Landing are the same way.)

Intel Haswell/Skylake (and maybe Ivybridge) don't rename al separately from rax at all, so they never need to merge low8 / low16 registers. But the setcc al has a false dependency on the old value. They do still rename and merge ah. (Details on HSW/SKL partial-reg performance.)


# partial flag stall when reading a flag that didn't come from
# the last instruction to write any flags.
clc
# edi and esi = one-past-the-end of dst and src
# ecx = -count
bigInt_add:
    mov   eax, [esi+ecx*4]
    adc   [edi+ecx*4], eax   # reads CF, partial flag stall on 2nd and later iterations
    inc   ecx                # writes all flags except CF
    jl    bitInt_add         # loop upwards towards zero

See this Q&A for more discussion about partial-flags issues on Intel pre-Sandybridge vs. Sandybridge.


See also Agner Fog's microarch pdf, and other links in the x86 tag wiki for more details about all of this.

like image 50
Peter Cordes Avatar answered Nov 02 '22 18:11

Peter Cordes