Is there any architecture that uses the same register space for scalar integer and floating point operations?

Tags:

cpu-registers

Most architectures I've seen that support native scalar hardware FP support shove them off into a completely separate register space, separate from the main set of registers.

X86's legacy x87 FPU uses a partially separate floating-point "stack machine" (read: basically a fixed-size 8-item ring buffer) with registers st(0) through st(7) to index each item. This is probably the most different of the popular ones. It can only interact with other registers through load/store to memory, or by sending compare results to EFLAGS. (286 fnstsw ax, and i686 fcomi).
FPU-enabled ARM has a separate FP register space that works similarly to its integer space. The primary difference is a separate instruction set specialized for floating-point, but even the idioms mostly align.
MIPS is somewhere in between, in that floating point is technically done through a coprocessor (at least visibly) and it has slightly different rules surrounding usage (like doubles using two floating-point registers rather than single extended registers), but they otherwise work fairly similarly to ARM.
X86's newer SSE scalar instructions operate similarly to their vector instructions, using similar mnemonics, and idioms. It can freely load and store to standard registers and to memory, and you can use a 64-bit memory reference as an operand for many scalar operations like addsd xmm1, m64 or subsd xmm1, m64, but you can only load from and store to registers via movq xmm1, r/m64, movq r/m64, xmm1, and friends. This is similar to ARM64 NEON, although it's slightly different from ARM's standard scalar instruction set.

Conversely, many vectorized instructions don't even bother with this distinction, just drawing a distinction between scalar and vector. In the case of x86, ARM, and MIPS all three:

They separate the scalar and vector register spaces.
They reuse the same register space for vectorized integer and floating-point operations.
They can still access the integer stack as applicable.
Scalar operations simply pull their scalars from the relevant register space (or memory in the case of x86 FP constants).

But I was wondering: are there any CPU architectures that reuse the same register space for integer and floating point operations?

And if not (due to reasons beyond compatibility), what would be preventing hardware designers from choosing to go that route?

452

asked Jul 23 '18 05:07

2 Answers

The Motorola 88100 had a single register file (thirty-one 32-bit entries plus a hardwired zero register) used for floating point and integer values. With 32-bit registers and support for double precision, register pairs had to be used to supply values, significantly constraining the number of double precision values that could be kept in registers.

The follow-on 88110 added thirty-two 80-bit extended registers for additional (and larger) floating point values.

Mitch Alsup, who was involved in Motorola's 88k development, has developed his own load-store ISA (at least partially for didactic reasons) which, if I recall correctly, uses a unified register file.

It should also be noted that the Power ISA (descendant from PowerPC) defines an "Embedded Floating Point Facility" which uses GPRs for floating point values. This reduces core implementation cost and context switch overhead.

One benefit of separate register files is that such provides explicit banking to reduce register port count in a straightforward limited superscalar design (e.g., providing three read ports to each file would allow all pairs of one FP, even three-source-operand FMADD, and one GPR-based operation to start in parallel and many common pairs of GPR-based operations compared with a five read ports with single register file to support FMADD and one other two-source operation). Another factor is that the capacity is additional and the width independent; this has both advantages and disadvantages. In addition, by coupling storage with operations a highly distinct coprocessor can be implemented in a more straightforward manner. This was more significant for early microprocessors given chip size limits, but the UltraSPARC T1 shared a floating point unit with eight cores and AMD's Bulldozer shared an FP/SIMD unit with two integer "cores".

A unified register file has some calling convention advantages; values can be passed in the same registers regardless of the type of the values. A unified register file also reduces unusable resources by allowing all registers to be used for all operations.

157

answered Oct 20 '22 08:10

Paul A. Clayton

Historically of course, the FPU was an optional part of the CPU (so there were versions of a chip with/without the FPU). Or it could be an optional separate chip (e.g. 8086 + 8087 / 80286 + 80287 / ...), so it makes a ton of sense for the FPU to have its own separate registers.

Leaving out the FPU register file as well as the FP execution units (and forwarding network and logic to write-back results into FP register) is what you want when you make an integer-only version of a CPU.

So there has always been historical precedent for having separate FP registers.

But for a blue-sky brand new design, it's an interesting question. If you're going to have an FPU, it must be integrated for good performance when branching on FP comparisons and stuff like that. Sharing the same registers for 64-bit integer / double is totally plausible from a software and hardware perspective.

However, SIMD of some sort is also mandatory for a modern high-performance CPU. CPU-SIMD (as opposed to the GPU style) is normally done with short fixed-width vector registers, often 16 bytes wide, but recent Intel has widened to 32 or 64 bytes. Using only the low 8 bytes of that for 64-bit scalar integer registers leaves lot of wasted space (and maybe power consumption when reading/writing them in integer code).

Of course, moving data between GP integer and SIMD vector registers costs instructions, and sharing a register set between integer and SIMD would be nice for that, if it's worth the hardware cost.

The best case for this would be a hypothetical brand new ISA with a scalar FPU, especially if it's just an FPU and doesn't have integer SIMD. Even in that unlikely case, there are still some reasons:

Instruction encoding space

One significant reason for separate architectural registers is instruction encoding space / bits.

For an instruction to have a choice of 16 registers for each operand, that takes 4 bits per operand. Would you rather have 16 FP and 16 integer registers, or 16 total registers that compete with each other for register-allocation of variables?

FP-heavy code usually needs at least a few integer registers for pointers into arrays, and loop control, so having separate integer regs doesn't mean they're all "wasted" in an FP loop.

I.e for the same instruction-encoding format, the choice is between N integer and N FP registers vs. N flexible registers, not 2N flexible registers. So you get twice as many total separate registers by having them split between FP and int.

32 flexible registers would probably be enough for a lot of code, though, and many real ISAs do have 32 architectural registers (AArch64, MIPS, RISC-V, POWER, many other RISCs). That takes 10 or 15 bits per instructions (2 or 3 operands per instruction, like add dst, src or add dst, src1, src2). Having only 16 flexible registers would definitely be worse than having 16 of each, though. In algorithms that use polynomial approximations for functions, you often need a lot of FP constants in registers, and that doesn't leave many for unrolling to hide the latency of FP instructions.

summary: 32 combined/flexible regs would usually be better for software than 16 int + 16 fp, but that costs extra instruction bits. 16 flexible regs would be significantly worse than 16 int + 16 FP, running into worse register pressure in some FP code.

Interrupt handlers usually have to save all the integer regs, but kernel code is normally built with integer instructions only. So interrupt latency would be worse if interrupt handlers had to save/restore the full width of 32 combined regs, instead of just 16 integer regs. They might still be able to skip save/restore of FPU control/status regs.

(An interrupt handler only needs to save the registers it actually modifies, or if calling C, then call-clobbered regs. But an OS like Linux tends to save all the integer regs when entering the kernel so it has the saved state of a thread in once place for handling ptrace system calls that modify the state of another process/thread. At least it does this at system-call entry points; IDK about interrupt handlers.)

If we're talking about 32int + 32fp vs. 32 flexible regs, and the combined regs are only for scalar double or float, then this argument doesn't really apply.

Speaking of calling conventions, when you use any FP registers, you tend to use a lot of them, typically in a loop with no non-inline function calls. It makes sense to have lots of call-clobbered FP registers.

But for integers, you tend to want an even mix of call-clobbered vs. call-preserved so you have some scratch regs to work with in small functions without saving/restoring something, but also lots of regs to keep stuff in when you are making frequent function calls.

Having a single set of registers would simplify calling conventions, though. Why not store function parameters in XMM vector registers? discusses more about calling convention tradeoffs (too many call-clobbered vs. too many call-preserved.) The stuff about integers in XMM registers wouldn't apply if there was only a single flat register space, though.

CPU physical design considerations

This is another set of major reasons.

First of all, I'm assuming a high-performance out-of-order design with large physical register files that the architectural registers are renamed onto. (See also my answer on Why does mulss take only 3 cycles on Haswell, different from Agner's instruction tables? (Unrolling FP loops with multiple accumulators)).

As @PaulClayton's answer points out, splitting the physical register file into integer and FP reduces the demand for read/write ports in each one. You can provide 3-source FMA instructions without necessarily providing any 3-input integer instructions.

(Intel Haswell is an example of this: adc and cmovcc are still 2 uops, but FMA is 1. Broadwell made adc and cmov into single-uop instructions, too. It's not clear if register reads are the bottleneck in this loop that runs 7 unfused-domain uops per clock on Skylake, but only 6.25 on Haswell. It gets slower when changing some instructions from a write-only destination to read+write, and adding indexed addressing modes (blsi ebx, [rdi] to add ebx, [rdi+r8].) The latter version runs ~5.7 register-reads per clock on Haswell, or ~7.08 on Skylake, same as for the fast version, indicating that Skylake might be bottlenecked on ~7 register reads per clock. Modern x86 microarchitectures are extremely complicated and have a lot going on, so we can't really conclude much from that, especially since max FP uop throughput is nearly as high as max integer uop throughput.)

However, Haswell/Skylake have no trouble running 4x add reg, reg, which reads 8 registers per clock and writes 4. The previous example was constructed to mostly read "cold" registers that weren't also written, but repeated 4xadd will be reading only 4 cold registers (or 1 cold reg 4 times) as a source. Given limited registers, the destination was only written a few cycles ago at most, so might be bypass forwarded.

I don't know exactly where the bottleneck is in my example on Agner Fog's blog, but it seems unlikely that it's just integer register reads. Probably related to trying to max out unfused-domain uops, too.

Physical distances on chip are another major factor: you want to physically place the FP register file near the FP execution units to reduce power and speed-of-light delays in fetching operands. The FP register file has larger entries (assuming SIMD), so reducing the number of ports it needs can save area or power on accesses to that many bits of data.)

Keeping the FP execution units in one part of the CPU can make forwarding between FP operations faster than FP->integer. (Bypass delay). x86 CPUs keep SIMD/FP and integer pretty tightly coupled, with low cost for transferring data between scalar and FP. But some ARM CPUs basically stall the pipeline for FP->int, so I guess normally they're more loosely interacting. As a general rule in HW design, two small fast things are normally cheaper / lower-powered than one large fast thing.

Agner Fog's Proposal for an ideal extensible instruction set (now on Github and called ForwardCom) spawned some very interesting discussion about how to design an ISA, including this issue.

His original proposal was for a unified r0..r31 set of architectural registers, each 128-bit, supporting integer up to 64 bit (optionally 128-bit), and single/double (optionally quad) FP. Also usable as predicate registers (instead of having FLAGS). They could also be used as SIMD vectors, with optional hardware support for vectors larger than 128-bit, so software could be written / compiled to automatically take advantage of wider vectors in the future.

Commenters suggested splitting vector registers separate from scalar, for the above reasons.

Specifically, Hubert Lamontagne commented:

Registers:

As far as I can tell, separate register files are GOOD. The reason for this is that as you add more read and write ports to a register file, its size grows quadratically (or worse). This makes cpu components larger, which increases propagation time, and increases fanout, and multiplies the complexity of the register renamer. If you give floating point operands their own register file, then aside from load/store, compare and conversion operations, the FPU never has to interact with the rest of the core. So for the same amount of IPC, say, 2 integer 2 float per cycle, separating float operations means you go from a monstruous 8-read 4-write register file and renaming mechanism where both integer ALUs and FP ALUs have to be wired everywhere, to a 2-issue integer unit and a 2-issue FPU. The FPU can have its own register renaming unit, its own scheduler, its own register file, its own writeback unit, its own calculation latencies, and FPU ALUs can be directly wired to the registers, and the whole FPU can live on a different section of the chip. The front end can simply recognize which ops are FPU and queue them there. The same applies to SIMD.

Further discussion suggested that separating scalar float from vector float would be silly, and that SIMD int and FP should stay together, but that dedicated scalar integer on its own does make sense because branching and indexing are special. (i.e. exactly like current x86, where everything except scalar integer is done in XMM/YMM/ZMM registers.)

I think this is what Agner eventually decided on.

If you were only considering scalar float and scalar int, there's more of a case to be made for unified architectural registers, but for hardware-design reasons it makes a lot of sense to keep them separate.

If you're interested in why ISAs are designed the way they are, and what could be better if we had a clean slate, I'd highly recommend reading through that whole discussion thread, if you have enough background to understand the points being made.

answered Oct 20 '22 07:10

Peter Cordes

Related questions
                            
                                About Adaptive Mode for L1 Cache in Hyper-threading
                            
                                Out-of-order instruction execution: is commit order preserved?
                            
                                How to tell length of an x86-64 instruction opcode using CPU itself?
                            
                                How do Intel CPUs that use the ring bus topology decode and handle port I/O operations
                            
                                Where does the scheduler run?
                            
                                The inner workings of Spectre (v2)
                            
                                Why doesn't Ice Lake have MOVDIRx like tremont? Do they already have better ones?
                            
                                Assembly why is "lea eax, [eax + eax*const]; shl eax, eax, const;" combined faster than "imul eax, eax, const" according to gcc -O2?
                            
                                Why is my C++ app faster than my C app (using the same library) on a Core i7
                            
                                How does the branch predictor know if it is not correct?
                            
                                Using System.getProperty("os.arch") to check if it is armeabi cpu
                            
                                ARM Cortex-M exception entry and stack framing
                            
                                LFENCE is really useless vs. Spectre #2?
                            
                                What are function epilogues and prologues?
                            
                                Advantages of a 64 bit system
                            
                                Lightweight method to use Amd64 instructions under 32-bit Windows?
                            
                                How does 32-bit address 4GB if 2³² bits = 4 Billion bits not Bytes?
                            
                                x86-64 usage of LFENCE
                            
                                Who Decides Between I/O Mapped and Memory Mapped I/O (x86)
                            
                                Why are there so many CPU architectures: x86, x64, x87, etc...?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Is there any architecture that uses the same register space for scalar integer and floating point operations?

Tags:

cpu-architecture