As I understand it, on a modern out of order CPU, one of the most expensive things is state, because that state has to be tracked in multiple versions, kept up-to-date across many instructions etc.
Some instruction sets like x86 and ARM make extensive use of flags, which were introduced when the cost model was not what what it is today, and the flags only cost a few logic gates. Things like every arithmetic instruction setting flags to detect zero, carry and overflow.
Are these particularly expensive to keep updated on a modern out of order implementation? Such that e.g. an ADD instruction updates the carry flag, and this must be tracked because although it will probably never be used, it is possible that some other instruction could use it N instructions later, with no fixed upper bound on N?
Are integer operations like addition and subtraction cheaper on instruction set architectures like MIPS that do not have these flags?
Various aspects of this are not very publicly known, so I will try to separate definitely known things from reasonable guesses and conjecture.
An approach has been to extend the (physical) integer registers (whether they take the form of a physical register file [eg P4 and SandyBridge+] or of results-in-ROB [eg P3]) with the flags that were produced by the operation that also produced the associated integer result. That's only about the arithmetic flags (sometimes AFLAGS, not to be confused with EFLAGS), but I don't think the "weird flags" are the focus of this question. Interestingly there is a patent[1] that hints at storing more than just the 6 AFLAGS themselves, putting some "combination flags" in there as well, but who know whether that was really done - most sources say the registers are extended by 6 bits, but AFAIK we (the public) don't really know. Lumping the integer result and associated flags together is described in for example this patent[2], which is primarily about preventing a certain situation where the flags might accidentally no longer be backed by any physical register. Aside from such quirks, during normal operation it has the nice effect of only needing to allocate 1 register for an arithmetic operation, rather than a separate main-result and flags-result, so renaming is normally not made much worse by the existence of the flags. Additionally, either the register alias table needs at least one more slot to keep track of which integer register contains the latest flags, or a separate flag-renaming-state buffer keeps track of the latest speculative flag state ([2] suggests Intel chose to separate them, which may simplify the main RAT but they don't go into such details). More slots may be used[3] to efficiently implement instructions which only update a subset of the flags (NetBurst™ famously lacked this, resulting in the now-stale advice to favour add
over inc
). Similarly, the non-speculative architectural state (whether it would be part of the retirement register file or be separate-but-similar again is not clear) needs at least one such slot.
A separate issue is computing the flags in the first place. [1] suggests separating flag generation from the main ALU simplifies the design. It's not clear to what degree they would be separated: the main ALU has to compute the Adjust and Sign flags anyway, and having an adder output a carry out the top is not much to ask (less than recomputing it from nothing). The overflow flag only takes an extra XOR gate to combine the carry into the top bit with the carry out of the top bit. The Zero flag and Parity flag are not for free though (and they depend on the result, not on the calculation of the result), if there is partial separation it would make sense that those would be computed separately. Perhaps it really is all separate. In NetBurst™, flag calculation took an extra half-cycle (the ALU was double-pumped and staggered)[4], but whether that means all flags are computed separately or a subset of them (or even a superset as [1] hinted) is not clear - the flags result is treated as monolithic so latency tests cannot distinguish whether a flag is computed in the third half-cycle by the flags unit or just handed to the flags unit by the ALU. In any case, typical ALU operations could be executed back-to-back, even if dependent (meaning that the high half of the first operation and the low half of the second operation ran in parallel), the delayed computation of the flags did not stand in the way of that. As you might expect though, ADC
and SBB
were not so efficient on NetBurst, but there may be other reasons for that too (for some reason a lot of µops are involved).
Overall I would conclude that the existence of arithmetic flags costs significant engineering resources to prevent them from having a significant performance impact, but that effort is also effective, so a significant impact is avoided.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With