I am looking for the fastest / most space efficient way of reducing a 64-bit register to a 32-bit register, only retaining the zero / non-zero status of the 64-bit register.
My current best idea that works for all values is popcntq
(1c tput, 3c latency on mainstream Intel, 5 byte code size):
// rax is either zero or non-zero popcntq %rax, %rax // eax will be zero if rax was zero, otherwise it will be non-zero
NOTE: It will not work to just use the 32-bit eax
directly: if rax
was say 2^61
the zero / non-zero status of eax
is not the same as of rax
Is there some better clever method?
TL;DR summary: xor same, same is the best choice for all CPUs. No other method has any advantage over it, and it has at least some advantage over any other method. It's officially recommended by Intel and AMD, and what compilers do.
While the x86/x64 architectures do not have an architectural zero register it seems likely that the Sandybridge processor has a physical zero register. When the renamer detects one of these special instructions it just renames the architectural register to point at the zero register.
Using a 32-bit register to address memory, the program can access (almost) all of the memory in a modern computer.
Registers. x64 extends x86's 8 general-purpose registers to be 64-bit, and adds 8 new 64-bit registers. The 64-bit registers have names beginning with "r", so for example the 64-bit extension of eax is called rax. The new registers are named r8 through r15.
Fewest uops (front-end bandwidth):
1 uop, latency 3c (Intel) or 1c (Zen).
Also smallest code-size, 5 bytes.
popcnt %rax, %rax # 5 bytes, 1 uop for one port # if using a different dst, note the output dependency on Intel before ICL
On most CPUs that have it at all, it's 3c latency, 1c throughput (1 uop for one port).
Or 1c on Zen1/2/3 with 0.25c latency. (https://agner.org/optimize/ & https://uops.info/)
On Bulldozer-family before Excavator, popcnt r64
is 4c latency, 4c throughput. (32-bit operand-size has 2c throughput but still 4c latency.) Bobcat has quite slow microcoded popcnt.
Lowest latency (assuming Haswell or newer so there's no partial-register effect when writing AL and then reading EAX, or a uarch with no P6 ancestry that doesn't rename partial regs):
2 cycle latency, 2 uops, 6 bytes. Also the smallest code-size if popcnt (5B) isn't available.
test %rax, %rax # 3B, 1 uop any ALU port setnz %al # 3B, 1 uop p06 (Haswell .. Ice Lake) # only works in-place, with the rest of EAX already being known 0 for RAX==0
AL is the low byte of EAX, so AL=1 definitely makes EAX non-zero for any non-zero RAX.
This will cost a merging uop when reading EAX on Sandybridge/Ivy Bridge. Core2 / Nehalem will stall for a couple cycles to insert that merging uop. Earlier P6-family like Pentium-M will fully stall until the setcc
retires if a later instruction reads EAX. (Why doesn't GCC use partial registers?)
Nate's neg
/sbb
is about the same as this on Broadwell and later, but 1 byte shorter. (And does zero the upper 32). This is better on Haswell, where sbb
is 2 uops. On earlier mainstream Intel CPUs, they both require 3 uops, with this one needing a merging uop when EAX is read, or sbb
(other than sbb $0
on SnB/HSW) always requiring 2 uops. neg/sbb can be used into a different register (still destroying the input), but has a false dependency on CPUs other than AMD. (K8/K10, Bulldozer-family, and Zen-family all recognize sbb same,same
as depending only on CF).
If you want the upper 32 zeroed, BMI2 RORX to copy-and-shift:
2 uops, 2c latency, 8 bytes
rorx $32, %rax, %rdx # 6 bytes, 1 uop, 1c latency or %edx, %eax # 2 bytes, 1c latency ## can produce its result in a different register without a false dependency.
rorx $32
is useful in general for horizontal SWAR reductions, e.g. for dword horizontal sum, you could movq
a pair of dwords out of an XMM register and do the last shuffle+add in scalar using rorx/add instead of pshufd/paddd.
Or without BMI2 while still zeroing the upper 32:
7 bytes, 4 uops, 3c latency on Intel (where bswap r64
is 2 uops, 2c latency), otherwise 3 uops 2c latency on CPUs with efficient bswap r64
like Zen-family and Silvermont-family.
mov %eax, %edx # 2 bytes, and not on the critical path bswap %rax # 3 bytes, vs. 4 for shr $32, %rax or %edx, %eax # 2 bytes ## can write a different destination
A compromise using shr $32, %rax
in place of bswap
is
8 bytes, 3 uops, 2c latency for the above even without mov-elimination.
Running an ALU instruction on the original register instead of the mov
result lets a non-eliminated MOV run in parallel with it.
Background for evaluating "best" performance:
Ideas that didn't pan out:
bsf %rax, %rax # 4 bytes. Fails for RAX=1
Leaves the destination unmodified for input=0. (AMD documents this, Intel implements it but doesn't document it as future-proof.) Unfortunately this produces RAX=0 for an input of 1.
And has the same perf as popcnt
on Intel, and worse on AMD, but does save 1 byte of code size.
(Using sub $1
to set CF and then ??; Nate's usage of neg
is how to make that work cleanly.)
I didn't try using a superoptimizer to brute-force check for other possible sequences, but as Nate commented this is a short enough problem to be a use-case for one.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With