Microarchitectural zeroing of a register via the register renamer: performance versus a mov?

Question

I read on a blog post that recent X86 microarchitectures are also able to handle common register zeroing idioms (such as xor-ing a register with itself) in the register renamer; in the words of the author:

"the register renamer also knows how to execute these instructions – it can zero the registers itself."

Does anybody know how this works in practice? I know that some ISAs, like MIPS, contain an architectural register that is always set to zero in hardware; does this mean that internally, the X86 microarchitecture has similar "zero" registers internally that registers are mapped to when convenient? Or is my mental model not quite correct on how this stuff works microarchitecturally?

The reason why I am asking is because (from some observation) it seems that a mov from one register containing zero to a destination, in a loop, is still substantially faster than zeroing the register via xor within the loop.

Basically what it happening is that I would like to zero a register within a loop depending on a condition; this can either be done by allocating an architectural register ahead of time to store zero (%xmm3, in this case), which is not modified for the entire duration of the loop, and executing the following within it:

movapd  %xmm3, %xmm0

or instead with the xor trick:

xorpd   %xmm0, %xmm0

(Both AT&T syntax).

In other words choice is between hoisting a constant zero outside of the loop or rematerializing it within it for each iteration. The latter reduces the number of live architectural registers by one, and, with the supposed special case awareness and handling of the xor idiom by the processor, it seems like it ought to be as fast as the former (especially since these machines have more physical registers than architectural registers anyway, so it should be able to internally do the equivalent to what I've done in the assembly by hoisting out the constant zero or even better, internally, with full awareness and control over its own resources). But it doesn't seem to be, so I'm curious if anyone with CPU architecture knowledge can explain if there's a good theoretical reason for that.

_{The registers in this case happen to by SSE registers and the machine happens to be Ivy Bridge; I'm not sure how important either of those factors are.}

Raymond Hettinger · Accepted Answer

Executive summary: You can run up to four xor ax, ax instructions per cycle as compared to the slower mov immediate, reg instructions.

Details and references:

Wikipedia has a nice overview of register renaming in general: http://en.wikipedia.org/wiki/Register_renaming

Torbj¨orn Granlund's timings for instruction latencies and throughput for AMD and Intel x86 processors are at: http://gmplib.org/~tege/x86-timing.pdf

Agner Fog nicely covers the specifics in his Micro-architecture study:

8.8 Register allocation and renaming

Register renaming is controlled by the register alias table (RAT) and the reorder buffer (ROB) ... The µops from the decoders and the stack engine go to the RAT via a queue and then to the ROB-read and the reservation station. The RAT can handle 4 µops per clock cycle. The RAT can rename four registers per clock cycle, and it can even rename the same register four times in one clock cycle.

Special cases of independence

A common way of setting a register to zero is by XOR'ing it with itself or subtracting it from itself, e.g. XOR EAX,EAX. The Sandy Bridge processor recognizes that certain instructions are independent of the prior value of the register if the two operand registers are the same. This register is set to zero at the rename stage without using any execution unit. This applies to all of the following instructions: XOR, SUB, PXOR, XORPS, XORPD, VXORPS, VXORPD and all variants of PSUBxxx and PCMPGTxx, but not PANDN etc.

Instructions that need no execution unit

The abovementioned special cases where registers are set to zero by instructions such as XOR EAX,EAX are handled at the register rename/allocate stage without using any execution unit. This makes the use of these zeroing instructions extremely efficient, with a throughput of four zeroing instructons per clock cycle.

Bruce Dawson · Answer

The biggest performance cost in your zeroing is hidden in this sentence:

Basically what it happening is that I would like to zero a register within a loop depending on a condition

That sentence implies a branch. Even if the branch is correctly predicted it will still likely cost more than zeroing a register.

As for register renaming...

In an OutOfOrder (OOO) CPU every time you write to a register the CPU gives you a new register. If you executed these three instructions:

xor eax,eax
add eax,eax
add eax,1

then for the first instruction the CPU (if it's a recent Intel CPU anyway) just updates its mappings to say that eax now refers to an internal zero register. On the first add it reads from eax (twice, since it is used twice as an input) and then updates its mapping to point to a new register and writes the result to that register. The same thing happens with the second add. So, over the course of these three instructions the eax register is changed to point to three different physical registers.

Why? Because of this:

mov eax,[esi]    ; Load from esi
add eax, 1
mov [esi], eax   ; Store to esi
mov eax,[esi+4]  ; Load from esi+4
add eax, 1
mov [esi+4], eax ; Store to esi+4

On an OOO processor one of the main restrictions on performance is dependencies. Instructions one to three must execute in order. Instructions four to six must execute in order. But there are no dependencies between those two blocks. So, one-to-three and four-to-six can execute in parallel. But, they all refer to eax.

No problem. Register renaming solves this. The first and fourth instructions execute simultaneously. The CPU creates a separate mapping for eax for each point in the instruction flow, and subsequent instructions operate on these renamed registers. This allows the two blocks of instructions to execute fully in parallel.

This is actually fiendishly complicated for all sorts of reasons, but it works, and it is one of the main things that allows modern CPUs to run so fast.

Anyway, long-story short, "xor eax,eax" never even executes, and that is cool. That optimization could be applied to any instruction that always produces zero or always produces ones, or whatever, but Intel is only going to spend the transistors to do this when it matters. I guess xorpd doesn't make the cut, yet.

I blogged about this (http://randomascii.wordpress.com/2012/12/29/the-surprising-subtleties-of-zeroing-a-register/) because I thought it was cool. I also liked the idea that 'add' and 'sub', which are mostly identical instructions, can have slightly or vastly different performance because of this behavior, albeit only in the case where a register is subtracted from itself.

Microarchitectural zeroing of a register via the register renamer: performance versus a mov?

Tags:

cpu-architecture

x86

assembly

x86-64

ice-nine

2 Answers

8.8 Register allocation and renaming

Special cases of independence

Instructions that need no execution unit

Raymond Hettinger

Bruce Dawson

Recent Activity

Donate For Us

Microarchitectural zeroing of a register via the register renamer: performance versus a mov?

Tags:

cpu-architecture

x86

assembly

x86-64

ice-nine

2 Answers

8.8 Register allocation and renaming

Special cases of independence

Instructions that need no execution unit

Raymond Hettinger

Bruce Dawson

Related questions

Recent Activity

Donate For Us