This question is for packed, single-prec floating ops with XMM/YMM registers on Haswell.
So according to the awesome, awesome table put together by Agner Fog, I know that MUL can be done on either port p0 and p1 (with recp thruput of 0.5), while only ADD is done on only port p1 (with recp thruput of 1). I can except this limitation, BUT I also know that FMA can be done on either port p0 or p1 (with recp thruput of 0.5). So it is confusing to my as to why a plain ADD would be limited to only p1, when FMA can use either p0 or p1 and it does both ADD and MUL. Am I misunderstanding the table? Or can someone explain why that would be?
That is, if my reading is correct, why wouldn't Intel just use FMA op as the basis for both plain MUL and plain ADD, and thereby increasing thruput of ADD as well as MUL. Alternatively, what would stop me from using two simultaneous, independent FMA ops to emulate two simultaneous, independent ADD ops? What are the penalties associated with doing ADD-by-FMA? Obviously, there is a greater number of registers used (2 reg for ADD vs 3 reg for ADD-by-FMA), but other than that?
You're not the only one confused as to why Intel did this. Agner Fog in his micro-architecture manual writes for Haswell:
It is strange that there is only one port for floating point addition, but two ports for floating point multiplication.
On Agner's message board he also writes
There are two execution units for floating point multiplication and for fused multiply-and-add, but only one execution unit for floating point addition. This design appears to be suboptimal since floating point code typically contains more additions than multiplications.
That thread continues with more information on the subject which I suggest you read but I won't quote here.
He also discusses it in this answer here flops-per-cycle-for-sandy-bridge-and-haswell-sse2-avx-avx2
The latency of FMA instructions on Haswell is 5 and the throughput is 2 per clock. This means that you must keep 10 parallel operations going to get the maximum throughput. If, for example, you want to add a very long list of f.p. numbers, you would have to split it in ten parts and use ten accumulator registers.
This is possible indeed, but who would make such a weird optimization for one specific processor?
His answer there basically answers your question. You can use FMA to double the throughput of addition. In fact I do this in my throughput tests for addition and indeed see that it doubles.
In summary, for addition, if your calculation is latency bound then don't use FMA use ADD. But If it's throughput bound you can try and use FMA (by setting the multiplier to 1.0) but you will probably have to use many AVX registers to do this.
I unrolled 10 times to get maximum througput here loop-unrolling-to-achieve-maximum-throughput-with-ivy-bridge-and-haswell
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With