Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why doesn't GCC use partial registers?

Disassembling write(1,"hi",3) on linux, built with gcc -s -nostdlib -nostartfiles -O3 results in:

ba03000000     mov edx, 3 ; thanks for the correction jester! bf01000000     mov edi, 1 31c0           xor eax, eax e9d8ffffff     jmp loc.imp.write 

I'm not into compiler development but since every value moved into these registers are constant and known compile-time, I'm curious why doesn't gcc uses dl, dil, and al instead. Some may argue that this feature won't make any difference in performance but there's a big difference in executable size between mov $1, %rax => b801000000 and mov $1, %al => b001 when we are talking about thousands of register accesses in a program. Not only small size if part of a software's elegance, it does have effect on performance.

Can someone explain why did "GCC decide" that it doesn't matter?

like image 799
Ábrahám Endre Avatar asked Jan 10 '17 16:01

Ábrahám Endre


2 Answers

Yes, GCC generally avoids writing to partial registers, unless optimizing for size (-Os) instead of purely speed (-O3). Some cases require writing at least the 32-bit register for correctness, so a better example would be something like:

char foo(char *p) { return *p; } compiles to movzx eax, byte ptr [rdi]
instead of mov al, [rdi]. https://godbolt.org/z/4ca9cTG9j

But GCC doesn't always avoid partial registers, sometimes even causing partial-register stalls https://gcc.gnu.org/bugzilla/show_bug.cgi?id=15533


Writing partial registers entails a performance penalty on many x86 processors because they are renamed into different physical registers from their whole counterpart when written. (For more about register renaming enabling out-of-order execution, see this Q&A).

But when an instruction reads the whole register, the CPU has to detect the fact that it doesn't have the correct architectural register value available in a single physical register. (This happens in the issue/rename stage, as the CPU prepares to send the uop into the out-of-order scheduler.)

It's called a partial register stall. Agner Fog's microarchitecture manual explains it pretty well:

6.8 Partial register stalls (PPro/PII/PIII and early Pentium-M)

Partial register stall is a problem that occurs when we write to part of a 32-bit register and later read from the whole register or a bigger part of it.
Example:

; Example 6.10a. Partial register stall mov al, byte ptr [mem8] mov ebx, eax ; Partial register stall 

This gives a delay of 5 - 6 clocks. The reason is that a temporary register has been assigned to AL to make it independent of AH. The execution unit has to wait until the write to AL has retired before it is possible to combine the value from AL with the value of the rest of EAX.

Behaviour in different CPUs:

  • Intel early P6 family: see above: stall for 5-6 clocks until the partial writes retire.
  • Intel Pentium-M (model D) / Core2 / Nehalem: stall for 2-3 cycles while inserting a merging uop. (see this Q&A for a microbenchmark writing AX and reading EAX with or without xor-zeroing first)
  • Intel Sandybridge: insert a merging uop for low8/low16 (AL/AX) without stalling, or for AH/BH/CH/DH while stalling for 1 cycle.
  • Intel IvyBridge (maybe), but definitely Haswell / Skylake: AL/AX aren't renamed, but AH still is: How exactly do partial registers on Haswell/Skylake perform? Writing AL seems to have a false dependency on RAX, and AH is inconsistent.
  • All other x86 CPUs: Intel Pentium4, Atom / Silvermont / Knight's Landing. All AMD (and Via, etc):

Partial registers are never renamed. Writing a partial register merges into the full register, making the write depend on the old value of the full register as an input.

Without partial-register renaming, the input dependency for the write is a false dependency if you never read the full register. This limits instruction-level parallelism because reusing an 8 or 16-bit register for something else is not actually independent from the CPU's point of view (16-bit code can access 32-bit registers, so it has to maintain correct values in the upper halves). And also, it makes AL and AH not independent. When Intel designed P6-family (PPro released in 1993), 16-bit code was still common, so partial-register renaming was an important feature to make existing machine code run faster. (In practice, many binaries don't get recompiled for new CPUs.)

That's why compilers mostly avoid writing partial registers. They use movzx / movsx whenever possible to zero- or sign-extend narrow values to a full register to avoid partial-register false dependencies (AMD) or stalls (Intel P6-family). Thus most modern machine code doesn't benefit much from partial-register renaming, which is why recent Intel CPUs are simplifying their partial-register renaming logic.

As @BeeOnRope's answer points out, compilers still read partial registers, because that's not a problem. (Reading AH/BH/CH/DH can add an extra cycle of latency on Haswell/Skylake, though, see the earlier link about partial registers on recent members of Sandybridge-family.)


Also note that write takes arguments that, for an x86-64 typically configured GCC, need whole 32-bit and 64-bit registers so it couldn't simply be assembled into mov dl, 3. The size is determined by the type of the data, not the value of the data.

Only 32-bit register writes implicitly zero-extend to the full 64-bit; writing 8 and 16-bit partial registers leave the upper bytes unchanged. (This makes it tricky for hardware to handle efficiently, which is why AMD64 didn't follow that pattern.)

Finally, in certain contexts, C has default argument promotions to be aware of, though this is not the case.
Actually, as RossRidge pointed out, the call was probably made without a visible prototype.


Your disassembly is misleading, as @Jester pointed out.
For example mov rdx, 3 is actually mov edx, 3, although both have the same effect—that is, to put 3 in the whole rdx.
This is true because an immediately value of 3 doesn't require sign-extension and a MOV r32, imm32 implicitly clears the upper 32 bits of the register.

like image 76
Margaret Bloom Avatar answered Sep 27 '22 16:09

Margaret Bloom


All three of the earlier answers are wrong in different ways.

The accepted answer by Margaret Bloom implies that partial register stalls are to blame. Partial register stalls are a real thing, but are unlikely to be relevant to GCC's decision here.

If GCC replaced mov edx,3 by mov dl,3, then the code would just be wrong, because writes to byte registers (unlike writes to dword registers) don't zero the rest of the register. The parameter in rdx is of type size_t, which is 64 bits, so the callee will read the full register, which will contain garbage in bits 8 to 63. Partial register stalls are purely a performance issue; it doesn't matter how fast the code runs if it's wrong.

That bug could be fixed by inserting xor edx,edx before mov dl,3. With that fix, there is no partial register stall, because zeroing a full register with xor or sub and then writing to the low byte is special-cased in all CPUs that have the stalling problem. So partial register stalls are still irrelevant with the fix.

The only situation where partial register stalls would become relevant is if GCC happened to know that the register was zero, but it wasn't zeroed by one of the special-cased instructions. For example, if this syscall was preceded by

loop:   ...   dec edx   jnz loop 

then GCC could deduce that rdx was zero at the point where it wants to put 3 in it, and mov dl,3 would be correct – but it would be a bad idea in general because it could cause a partial-register stall. (Here, it wouldn't matter because syscalls are so slow anyway, but I don't think GCC has a "slow function that there's no need to speed-optimize calls to" attribute in its internal type system.)


Why doesn't GCC emit xor followed by a byte move, if not because of partial register stalls? I don't know but I can speculate.

It only saves space when initializing r0 through r3, and even then it only saves one byte. It increases the number of instructions, which has its own costs (the instruction decoders are frequently a bottleneck). It also clobbers the flags unlike the standard mov, which means it isn't a drop-in replacement. GCC would have to track a separate flag-clobbering register initialization sequence, which in most cases (11/15 of possible destination registers) would be unambiguously less efficient.

If you're aggressively optimizing for size, you can do push 3 followed by pop rdx, which saves 2 bytes regardless of the destination register, and doesn't clobber the flags. But it is probably much slower because it writes to memory and has a false read-write dependence on rsp, and the space savings seem unlikely to be worth it. (It also modifies the red zone, so it isn't a drop-in replacement either.)


supercat's answer says

Processor cores often include logic to execute multiple 32-bit or 64-bit instructions simultaneously, but may not include logic to execute an 8-bit operation simultaneously with anything else. Consequently, while using 8-bit operations on the 8088 when possible was a useful optimization on the 8088, it can actually be a significant performance drain on newer processors.

Modern optimizing compilers actually use 8-bit GPRs quite a lot. (They use 16-bit GPRs relatively rarely, but I think that's because 16-bit quantities are uncommon in modern code.) 8-bit and 16-bit operations are at least as fast as 32-bit and 64-bit operations at most execution stages, and some are faster.

I previously wrote here "As far as I know, 8-bit operations are as fast as, or faster than, 32/64-bit operations on absolutely every 32/64 bit x86/x64 processor ever made." But I was wrong. Quite a few superscalar x86/x64 processors merge 8- and 16-bit destinations into the full register on every write, which means that write-only instructions like mov have a false read dependency when the destination is 8/16 bits which doesn't exist when it's 32/64 bits. False dependency chains can slow execution if you don't clear the register before every move (or during, using something like movzx). Newer processors have this problem even though the earliest superscalar processors (Pentium Pro/II/III) didn't have it. In spite of that, modern optimizing compilers do use the smaller registers in my experience.


BeeOnRope's answer says

The short answer for your particular case, is because gcc always sign or zero-extends arguments to 32-bits when calling a C ABI function.

But this function has no parameters shorter than 32 bits in the first place. File descriptors are exactly 32 bits long, and size_t is exactly 64 bits long. It doesn't matter that many of those bits are often zero. They aren't variable-length integers that are encoded in 1 byte if they're small. It would only be correct to use mov dl,3, with the rest of rdx possibly being nonzero, for a parameter if there was no integer promotion requirement in the ABI and the actual parameter type was char or some other 8-bit type.

like image 44
benrg Avatar answered Sep 27 '22 15:09

benrg