"Late-forwarding" is mentioned in "Arm Neoverse E1 Core Software Optimization Guide" (as well as in their optimization guides for some other CPU models):
Instruction Group | Instructions | Exec Latency | Exec Throughput | Notes |
---|---|---|---|---|
Multiply accumulate (32-bit) | MADD, MSUB | 3 (2) | 1 | 2 |
Multiply accumulate (64-bit) | MADD, MSUB | 5 (4) | 1/3 | 2 |
(2) Multiply-accumulate pipelines support late-forwarding of accumulate operands from similar μOPs, allowing a typical sequence of multiply-accumulate μOPs to issue one every N cycles (accumulate latency N shown in parentheses).
What does the term "late-forwarding" mean? What sequence of instructions would be subject to late-forwarding (counter-example would also be helpful)?
Late forwarding for multiply-add operations means that the addend can be made available after the multiplication has completed rather than having to be available when the multiply-add operation begins execution. Since the multiplication itself is not data dependent on the addend, it can proceed. Since some work for the addition can be done in parallel with the multiplication (the exponent of the product will be available early and can be used with the addend's exponent to determine the amount of shift needed before addition), one may want the addend to be available before the entire product is available, but even in that case the addend is not needed until much later than the multiplicands.
By delaying the forwarding (availability) of the addend, the effective latency of dependent accumulations is reduced. This reduces the number of accumulation registers (and parallelism) one needs to cover the latency.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With