Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Do FP and integer division compete for the same throughput resources on x86 CPUs?

We know that Intel CPUs do integer division and FP div / sqrt on a not-fully-pipelined divide execution unit on port 0. We know this from IACA output, other published stuff, and experimental testing. (e.g. https://agner.org/optimize/)

But are there independent dividers for FP and integer (competing only for dispatch via port 0), or does interleaving two div-throughput-bound workloads make their cost add nearly linearly, if one is integer and the other is FP?

This is complicated by Intel CPUs (unlike AMD) decoding integer division to multiple uops, e.g. 10 for div r32 on Skylake.


AMD CPUs similarly have their divider on one execution port, but I don't know as much about them and don't have one to test on. AMD integer division decodes to only a couple uops (to write RDX and RAX), not microcoded. Experiments on AMD might be easier to interpret without lots of uops flying around being a possible cause for contention between int and fp div.


Further reading:

  • Semi-related: Radix divider internals
  • Floating point division vs floating point multiplication - FP div/sqrt vs. multiply/FMA throughputs on various Intel and AMD CPUs.
  • Trial-division code runs 2x faster as 32-bit on Windows than 64-bit on Linux - Intel's 64-bit integer division is a lot slower. Decoding to more uops (36 vs. 10 on SKL) and not even saturating the arith.divider_active perf counter.
like image 565
Peter Cordes Avatar asked Oct 16 '19 21:10

Peter Cordes


1 Answers

Intel CPU architect Ronak Singhal mentions on Twitter that Broadwell (and by implication subsequent architectures until ICL) use the FP hardware for division, but that Ice Lake has a dedicated integer division unit:

Keep in mind that Broadwell that this was benchmarked on does integer division on the FP divider. In Ice Lake, there is now a dedicated integer divide unit.

So I would expect significant competition. Many of the operations that integer division perform no doubt are plain ALU ops not using the divider, so I wouldn't necessarily expect their inverse throughput to be strictly cumulative but they will definitely compete.

Ronak doesn't imply anything about pre-Broadwell implementation, but based on the similar port assignment and performance going back to at least Sandy Bridge, I think we can expect that the same sharing holds.

like image 137
BeeOnRope Avatar answered Oct 19 '22 01:10

BeeOnRope