While I know (so I have been told) that Floating-point coprocessors work faster than any software implementation of floating-point arithmetic, I totally lack the gut feeling how large this difference is, in order of magnitudes.
The answer probably depends on the application and where you work, between microprocessors and supercomputers. I am particularly interested in computer simulations.
Can you point out articles or papers for this question?
Linux is a popular operating system for embedded systems. Some embedded systems use processors without floating point accelerators (FPA) or floating point units (FPU), in order to satisfy cost and power consumption requirements.
An FPU provides a faster way to handle calculations with non-integer numbers. Any mathematical operation, such as addition, subtraction, multiplication, or division can be performed by either the integer processing unit or the FPU.
Floating-point emulation refers to the emulation of FPU hardware on architectures that have an FPU option but for which not all parts include the FPU. This allows a binary containing floating point instructions to run on a variant without an FPU.
A floating-point unit (FPU, colloquially a math coprocessor) is a part of a computer system specially designed to carry out operations on floating-point numbers. Typical operations are addition, subtraction, multiplication, division, and square root.
A general answer will obviously very vague, because performance depends on so many factors.
However, based on my understanding, in processors that do not implement floating point (FP) operations in hardware, a software implementation will typically be 10 to 100 times slower (or even worse, if the implementation is bad) than integer operations, which are always implemented in hardware on CPUs.
The exact performance will depend on a number of factors, such as the features of the integer hardware - some CPUs lack a FPU, but have features in their integer arithmetic that help implement a fast software emulation of FP calculations.
The paper mentioned by njuffa, Cristina Iordache and Ping Tak Peter Tang, An Overview of Floating-Point Support and Math Library on the Intel XScale Architecture supports this. For the Intel XScale processor the list as latencies (excerpt):
integer addition or subtraction: 1 cycle
integer multiplication: 2-6 cycles
fp addition (emulated): 34 cycles
fp multiplication (emulated): 35 cycles
So this would result in a factor of about 10-30 between integer and FP arithmetic. The paper also mentions that the GNU implementation (the one the GNU compiler uses by default) is about 10 times slower, which is a total factor of 100-300.
Finally, note that the above is for the case where the FP emulation is compiled into the program by the compiler. Some operating systems (e.g. Linux and WindowsCE) also have an FP emulation in the OS kernel. The advantage is that even code compiled without FP emulation (i.e. using FPU instructions) can run on a process without an FPU - the kernel will transparently emulate unsupported FPU instructions in software. However, this emulation is even slower (about another factor 10) than a software emulation compiled into the program, because of additional overhead. Obviously, this case is only relevant on processor architectures where some processors haven an FPU, and some do not (such as x86 and ARM).
Note: This answer compares the performance of (emulated) FP operations with integer operations on the same processor. Your question might also be read to be about the performance of (emulated) FP operations compared to hardware FP operations (not sure what you meant). However, the result would be about the same, because if FP is implemented in hardware, it is typically (almost) as fast as integer operations.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With