We know that Intel CPUs do integer division and FP div / sqrt on a not-fully-pipelined divide execution unit on port 0. We know this from IACA output, other published stuff, and experimental testing. (e.g. https://agner.org/optimize/)
But are there independent dividers for FP and integer (competing only for dispatch via port 0), or does interleaving two div-throughput-bound workloads make their cost add nearly linearly, if one is integer and the other is FP?
This is complicated by Intel CPUs (unlike AMD) decoding integer division to multiple uops, e.g. 10 for div r32 on Skylake.
AMD CPUs similarly have their divider on one execution port, but I don't know as much about them and don't have one to test on. AMD integer division decodes to only a couple uops (to write RDX and RAX), not microcoded. Experiments on AMD might be easier to interpret without lots of uops flying around being a possible cause for contention between int and fp div.
Further reading:
arith.divider_active perf counter.Intel CPU architect Ronak Singhal mentions on Twitter that Broadwell (and by implication subsequent architectures until ICL) use the FP hardware for division, but that Ice Lake has a dedicated integer division unit:
Keep in mind that Broadwell that this was benchmarked on does integer division on the FP divider. In Ice Lake, there is now a dedicated integer divide unit.
So I would expect significant competition. Many of the operations that integer division perform no doubt are plain ALU ops not using the divider, so I wouldn't necessarily expect their inverse throughput to be strictly cumulative but they will definitely compete.
Ronak doesn't imply anything about pre-Broadwell implementation, but based on the similar port assignment and performance going back to at least Sandy Bridge, I think we can expect that the same sharing holds.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With