Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

x86_64 best way to reduce 64 bit register to 32 bit retaining zero or non-zero status

I am looking for the fastest / most space efficient way of reducing a 64-bit register to a 32-bit register, only retaining the zero / non-zero status of the 64-bit register.

My current best idea that works for all values is popcntq
(1c tput, 3c latency on mainstream Intel, 5 byte code size):

// rax is either zero or non-zero popcntq %rax, %rax // eax will be zero if rax was zero, otherwise it will be non-zero 

NOTE: It will not work to just use the 32-bit eax directly: if rax was say 2^61 the zero / non-zero status of eax is not the same as of rax

Is there some better clever method?

like image 471
Noah Avatar asked Sep 21 '21 17:09

Noah


People also ask

What is the fastest method to set a CPU register to zero?

TL;DR summary: xor same, same is the best choice for all CPUs. No other method has any advantage over it, and it has at least some advantage over any other method. It's officially recommended by Intel and AMD, and what compilers do.

Does x86 have a zero register?

While the x86/x64 architectures do not have an architectural zero register it seems likely that the Sandybridge processor has a physical zero register. When the renamer detects one of these special instructions it just renames the architectural register to point at the zero register.

What would be size of register refernce instruction in x86?

Using a 32-bit register to address memory, the program can access (almost) all of the memory in a modern computer.

What are 64-bit registers?

Registers. x64 extends x86's 8 general-purpose registers to be 64-bit, and adds 8 new 64-bit registers. The 64-bit registers have names beginning with "r", so for example the 64-bit extension of eax is called rax. The new registers are named r8 through r15.


1 Answers

Fewest uops (front-end bandwidth):
1 uop, latency 3c (Intel) or 1c (Zen).
Also smallest code-size, 5 bytes.

 popcnt  %rax, %rax         # 5 bytes, 1 uop for one port  # if using a different dst, note the output dependency on Intel before ICL 

On most CPUs that have it at all, it's 3c latency, 1c throughput (1 uop for one port).
Or 1c on Zen1/2/3 with 0.25c latency. (https://agner.org/optimize/ & https://uops.info/)

On Bulldozer-family before Excavator, popcnt r64 is 4c latency, 4c throughput. (32-bit operand-size has 2c throughput but still 4c latency.) Bobcat has quite slow microcoded popcnt.


Lowest latency (assuming Haswell or newer so there's no partial-register effect when writing AL and then reading EAX, or a uarch with no P6 ancestry that doesn't rename partial regs):
2 cycle latency, 2 uops, 6 bytes. Also the smallest code-size if popcnt (5B) isn't available.

  test  %rax, %rax     # 3B, 1 uop any ALU port   setnz %al            # 3B, 1 uop p06 (Haswell .. Ice Lake) # only works in-place, with the rest of EAX already being known 0 for RAX==0 

AL is the low byte of EAX, so AL=1 definitely makes EAX non-zero for any non-zero RAX.

This will cost a merging uop when reading EAX on Sandybridge/Ivy Bridge. Core2 / Nehalem will stall for a couple cycles to insert that merging uop. Earlier P6-family like Pentium-M will fully stall until the setcc retires if a later instruction reads EAX. (Why doesn't GCC use partial registers?)

Nate's neg/sbb is about the same as this on Broadwell and later, but 1 byte shorter. (And does zero the upper 32). This is better on Haswell, where sbb is 2 uops. On earlier mainstream Intel CPUs, they both require 3 uops, with this one needing a merging uop when EAX is read, or sbb (other than sbb $0 on SnB/HSW) always requiring 2 uops. neg/sbb can be used into a different register (still destroying the input), but has a false dependency on CPUs other than AMD. (K8/K10, Bulldozer-family, and Zen-family all recognize sbb same,same as depending only on CF).


If you want the upper 32 zeroed, BMI2 RORX to copy-and-shift:
2 uops, 2c latency, 8 bytes

 rorx  $32, %rax, %rdx      # 6 bytes, 1 uop, 1c latency  or    %edx, %eax           # 2 bytes, 1c latency ## can produce its result in a different register without a false dependency. 

rorx $32 is useful in general for horizontal SWAR reductions, e.g. for dword horizontal sum, you could movq a pair of dwords out of an XMM register and do the last shuffle+add in scalar using rorx/add instead of pshufd/paddd.

Or without BMI2 while still zeroing the upper 32:
7 bytes, 4 uops, 3c latency on Intel (where bswap r64 is 2 uops, 2c latency), otherwise 3 uops 2c latency on CPUs with efficient bswap r64 like Zen-family and Silvermont-family.

 mov    %eax, %edx       # 2 bytes, and not on the critical path  bswap  %rax             # 3 bytes, vs. 4 for  shr $32, %rax   or     %edx, %eax       # 2 bytes ## can write a different destination 

A compromise using shr $32, %rax in place of bswap is
8 bytes, 3 uops, 2c latency for the above even without mov-elimination.
Running an ALU instruction on the original register instead of the mov result lets a non-eliminated MOV run in parallel with it.


Background for evaluating "best" performance:

  • How many CPU cycles are needed for each assembly instruction? - that's not how performance works; CPUs don't wait for one instruction to finish before starting the next, and overlap possibilities depend on the details.
  • What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?
  • https://agner.org/optimize/ guides and instruction tables
  • https://uops.info/ machine-generated instruction tables (uops, ports, latency) that's free from typos.
  • https://stackoverflow.com/tags/x86/info other links

Ideas that didn't pan out:

   bsf %rax, %rax    # 4 bytes.  Fails for RAX=1 

Leaves the destination unmodified for input=0. (AMD documents this, Intel implements it but doesn't document it as future-proof.) Unfortunately this produces RAX=0 for an input of 1.
And has the same perf as popcnt on Intel, and worse on AMD, but does save 1 byte of code size.

(Using sub $1 to set CF and then ??; Nate's usage of neg is how to make that work cleanly.)

I didn't try using a superoptimizer to brute-force check for other possible sequences, but as Nate commented this is a short enough problem to be a use-case for one.

like image 167
Peter Cordes Avatar answered Oct 23 '22 02:10

Peter Cordes