My teacher claims that the processor can sometimes do FPU operations in parallel. Like this:
float a = 3.14;
float b = 5.12;
float c;
float d = 3.02;
float e = 2.52;
float f;
c = a + b;
f = e + d;
So, as I've heard, the 2 add operations above would be executed quicker than:
float a = 3.14;
float b = 5.12;
float c;
float d = 3.02;
float e = 2.52;
float f;
c = a + b;
f = c + d;
because the processor has to wait until c
gets computed.
I wanted to verify this, so I wrote a function that does the second thing, and it measures the time by checking the Time Stamp Counter:
flds h # st(7)
flds g # st(6)
flds f # st(5)
flds e # st(4)
flds d # st(3)
flds c # st(2)
flds b # st(1)
flds a # st(0)
fadd %st, %st(1) # i = a + b
fmul %st, %st(2) # j = i * c
fadd %st, %st(3) # k = j + d
fmul %st, %st(4) # l = k + e
fadd %st, %st(5) # m = l + f
fmul %st, %st(6) # n = m * g
fadd %st, %st(7) # o = n + h
Those are not independent. Now, I am trying to write independent ones. But the problem is, no matter what I actually do, the value is always saved to ST(0)
(no matter which instruction I use), optionally it can then be popped, but that still means we have to wait until computation.
I looked at the code generated by a compiler (gcc -S
). It simply doesn't operate like this on st
registers. For every number, it does:
flds number
fstps -some_value(%ebp)
And then (for example, for a and b, where -4(%ebp)
is a, -8(%ebp)
is b):
flds -4(%ebp)
fadds -8(%ebp) # i = a + b
fstps -32(%ebp)
So it firstly loads to FPU, and pops back to the normal stack. Then, it pops one value (to st(0)
), adds to that value, and the result is popped back. So it's still not independent, because we have to wait until st(0)
gets freed.
Did my teacher say something wrong, or is there a way to make them independent that would give a noticeably different execution time when I measure it?
The x86 FPU was originally an optional addition to the processor that was able to perform floating point math in hardware, but has since been integrated into the CPU proper and has collected over the years the majority of math-heavy instructions.
How Does a Floating-Point Unit Work? FPUs handle mathematical operations related to floating-point numbers. So, when a CPU is instructed to calculate 125 x 16, the equation would be processed as an integer operation by the integer processing unit.
The floating point unit (FPU) was a separate chip through the 80386+80387. It is now located on-chip, but the programming model still requires most data to be transferred through memory, not between FPU and general purpose registers.
The most important registers in the FPU are a stack of eight extended floating point numbers. One of the registers is designated as the top of the stack (ST0) and all other registers are relative to it: ST1, ST2. Data is pushed on the stack with a load. Data is taken off the stack with a store.
In the style of PolitiFact, I would rate your teacher's statement that "the processor can sometimes do FPU operations in parallel" as "half-true". In some senses and under certain conditions, it is completely true; in other senses, it is not true at all. So to make the general statement is very misleading and very likely to be misinterpreted.
Now, most likely, your teacher said this in a very specific context, making some assumptions about what (s)he had already told you previously, and you didn't include all of that in the question, so I won't blame them for being intentionally misleading. Instead, I will try to clarify this general claim, pointing out some ways in which it is true and other ways in which it is false.
The big sticking point is exactly what is meant by "FPU operations". Classically, x86 processors have done FPU operations on a separate floating-point coprocessor (known as a floating-point unit, or FPU), the x87. Up until the 80486 processor, this was a separate chip installed on the main board. Starting with the 80486DX, the x87 FPU was integrated directly onto the same silicon as the main processor and was therefore available on all systems, instead of just those that had a specialized x87 FPU installed. This remains true today—all x86 processors have a built-in x87-compatible FPU, and this is generally what people refer to when they say "FPU" in the context of the x86 microarchitecture.
However, the x87 FPU is rarely used anymore for floating-point operations. Although it is still there, it has effectively been superseded by a SIMD unit that is both easier to program for and (in general) more efficient.
AMD was the first to introduce such a specialized vector unit with their 3DNow! technology in the K6-2 microprocessor (circa 1998). For various technical and marketing reasons, this didn't really get used, except in certain games and other specialized applications, and never caught on in the industry (AMD has since phased it out on modern processors), but it did support arithmetic operations on packed, single-precision floating-point values.
SIMD really began to catch on when Intel released the SSE extension with the Pentium III processor. SSE was similar to 3DNow!, in that it supported vector operations on single-precision floating-point values, but was incompatible with it and supported a slightly larger range of operations. AMD quickly added SSE support to their processors, too. The really nice thing about SSE compared to 3DNow! was that it used a completely separate set of registers, which made programming much easier. With the Pentium 4, Intel released SSE2, which was an extension of SSE that added support for double-precision floating-point values. SSE2 is supported by all processors that support the 64-bit long mode extensions (AMD64), which is all processors made today, so 64-bit code virtually always uses SSE2 instructions to manipulate floating-point values, rather than x87 instructions. Even in 32-bit code, SSE2 instructions are in common use today, since all processors since the Pentium 4 have supported them.
Aside from support for legacy processors, there is really only one reason to use x87 instructions today, and that is that the x87 FPU supported a special "long double" format, with 80 bits of precision. SSE supports only single-precision (32-bit), while SSE2 added support for double-precision (64-bit) values. If you absolutely need extended precision, then the x87 is your best option. (At the level of individual instructions, it is comparable in speed to the SIMD units operating on scalar values.) Otherwise, you prefer SSE/SSE2 (and later SIMD extensions to the instruction set, like AVX, etc.) And, of course, when I say "you", I don't just mean assembly-language programmers; I also mean compilers. For example, Visual Studio 2010 was the last major version to emit x87 code by default for 32-bit builds. In all later versions, SSE2 instructions are generated unless you specifically turn them off (/arch:IA32
).
With these SIMD instructions, it is entirely true that multiple floating-point operations can be done simultaneously—in fact, that is the whole point. And even when you're working with scalar (non-packed) floating-point values, as in the code you've shown, modern processors generally have multiple execution units that allow multiple operations to be done simultaneously (assuming certain conditions are met, like a lack of data dependencies, as you point out, and also which specific instructions are being executed [some instructions can only be executed on certain units, limiting the amount of true parallelism]).
But as I said before, the reason I term this claim misleading is because when someone says "FPU", it is generally understood to mean the x87 FPU, and in that case, the options for independent, parallel execution are substantially more limited. x87 FPU instructions are all the ones whose mnemonics begin with f
, including FADD
, FMUL
, FDIV
, FLD
, FSTP
, etc. These instructions cannot pair* and therefore can never be executed truly independently.
There is only one special exception to the rule that x87 FPU instructions cannot pair, and that is the FXCH
instruction (floating-point exchange). FXCH
can pair when it occurs as the second instruction in a pair, as long as the first instruction in the pair is either FLD
, FADD
, FSUB
, FMUL
, FDIV
, FCOM
, FCHS
, or FABS
, and the next instruction following FXCHG
is also a floating-point instruction. So, this does cover the most common cases where you would use FXCHG
. As Iwillnotexist Idonotexist alluded to in a comment, this magic is implemented internally via register renaming: the FXCH
instruction does not actually swap the contents of the two registers, as you might imagine; it only swaps the names of the registers. On the Pentium and later processors, registers can be renamed while they are in use, and can even be renamed more than once per clock, without incurring any stalls. This feature is actually very important to maintaining top performance in x87 code. Why? Well, the x87 is unusual in that it has a stack-based interface. Its "registers" (st0
through st7
) are implemented as a stack, and several floating-point instructions operate only on the value at the top of the stack (st0
). But a feature that allows you to use the stack-based interface of the FPU in a reasonably efficient manner hardly counts as "independent" execution.
However, it is true that many x87 FPU operations can overlap. This works just like any other type of instruction: since the Pentium, x86 processors have been pipelined, which effectively means that instructions execute in many different stages. (The longer the pipeline, the more stages of execution, which means the more instructions the processor can be working on at a time, which also generally means the faster the processor can be clocked. However, it has other disadvantages, like higher penalties for mispredicted branches, but I digress.) So, although each instruction still takes a fixed number of cycles to complete, it is possible for an instruction to begin executing before the previous one has finished. For example:
fadd st(1), st(0) ; clock cycles 1 through 3
fadd st(2), st(0) ; clock cycles 2 through 4
fadd st(3), st(0) ; clock cycles 3 through 5
fadd st(4), st(0) ; clock cycles 4 through 6
The FADD
instruction takes 3 clock cycles to execute, but we can start a new FADD
on each clock cycle. As you can see, it is possible to do up to 4 FADD
operations in only 6 clock cycles, which is twice as fast as the 12 clock cycles that this would take on a non-pipelined FPU.
Naturally, as you say in the question, this overlapping requires that there are no dependencies between the two instructions. In other words, two instructions cannot be overlapped if the second one requires the result of the first. In practice, this unfortunately means that the gains from this pipelining are limited. Because of the FPU's stack-based architecture that I mentioned earlier, and the fact that most floating-point instructions involve the value at the top of the stack (st(0)
), there are extremely few cases where it is possible for an instruction to be independent of the previous instruction's result.
The way around this conundrum is the pairing of the FXCH
instruction that I mentioned earlier, which makes it possible to interleave multiple, independent calculations if you are extremely careful and clever in your scheduling. Agner Fog, in an old version of his classic optimization manuals gives the following example:
fld [a1] ; cycle 1
fadd [a2] ; cycles 2-4
fld [b1] ; cycle 3
fadd [b2] ; cycles 4-6
fld [c1] ; cycle 5
fadd [c2] ; cycles 6-8
fxch st(2) ; cycle 6 (pairs with previous instruction)
fadd [a3] ; cycles 7-9
fxch st(1) ; cycle 7 (pairs with previous instruction)
fadd [b3] ; cycles 8-10
fxch st(2) ; cycle 8 (pairs with previous instruction)
fadd [c3] ; cycles 9-11
fxch st(1) ; cycle 9 (pairs with previous instruction)
fadd [a4] ; cycles 10-12
fxch st(2) ; cycle 10 (pairs with previous instruction)
fadd [b4] ; cycles 11-13
fxch st(1) ; cycle 11 (pairs with previous instruction)
fadd [c4] ; cycles 12-14
fxch st(2) ; cycle 12 (pairs with previous instruction)
In this code, three independent computations have been interleaved: (a1
+ a2
+ a3
+ a4
), (b1
+ b2
+ b3
+ b4
), and (c1
+ c2
+ c3
+ c4
). Since each FADD
takes 3 clock cycles, after we kick off the a
computation, we have two "free" cycles to start two new FADD
instructions for the b
and c
computations before returning to the a
computation. Every third FADD
instruction returns to the original computation, following a regular pattern. In between, FXCH
instructions are used to make the top of the stack (st(0)
) contain the value that belongs to the appropriate computation. Equivalent code could be written for FSUB
, FMUL
, and FILD
, since all three take 3 clock cycles and are able to overlap. (Well, except that, at least on the Pentium—I'm not sure if this holds true on later processors, since I don't use the x87 anymore—the FMUL
instruction is not perfectly pipelined, so you cannot start an FMUL
one clock cycle after another FMUL
. You either have a stall, or you have to throw another instruction in-between.)
I imagine that this kind of thing is what your teacher had in mind. In practice, though, even with the magic of the FXCHG
instruction, it is quite difficult to write code that truly achieves significant levels of parallelism. You need to have multiple independent computations that you can interleave, but in many cases, you are just calculating a single, big formula. There are sometimes ways to compute pieces of the formula independently, in parallel, and then combine them at the end, but you will inevitably have stalls in there that reduce the overall performance, and not all floating-point instructions can overlap. As you might imagine, this is so difficult to achieve that compilers rarely do (to any significant extent). It requires a human with the determination and fortitude to hand-optimize the code, manually scheduling and interleaving the instructions.
One thing that is more often possible is interleaving floating-point and integer instructions. Instructions like FDIV
are slow (~39 cycles on the Pentium) and do not overlap well with other floating-point instructions; however, it can overlap with integer instructions on all but its first clock cycle. (There are always caveats, and this is no exception: floating-point division cannot be overlapped with integer division because they are handled by the same execution unit on almost all processors.) Something similar could be done with FSQRT
. Compilers are somewhat more likely to perform these types of optimizations, assuming that you have written the code where integer operations are interspersed around floating-point operations (inlining helps dramatically with this), but still, in many cases where you're doing extended floating-point computations, you have little integer work that needs to be done.
Now that you have a better understanding of the complexities of achieving truly "independent" floating-point operations, and why the FADD
+FMUL
code you wrote doesn't actually overlap or perform any faster, let me briefly address the problems you ran into when trying to look at the output from a compiler.
(By the way, this is a great strategy and one of the primary ways that I learned how to write and optimize assembly code. And building on a compiler's output is still how I start when I want to hand-optimize a particular snippet of code.)
As I mentioned above, modern compilers don't generate x87 FPU instructions. They never do for 64-bit builds, so you must start by compiling in 32-bit mode. Then, you generally have to specify a compiler switch that instructs it not to use SSE instructions. In MSVC, this is /arch:IA32
. In Gnu-style compilers, like GCC and Clang, this is -mfpmath=387
and/or -mno-sse
.
There is one other little niggle that explains what you were actually seeing. The C code you were writing used the float
type, which is a single-precision (32-bit) type. As you learned above, the x87 FPU uses a special 80-bit "extended" precision internally. That mismatch in precision can affect the output of floating-point operations, so to strictly comply with the IEEE-754 and language-specific standards, compilers default to a "strict" or "precise" mode when using the x87 FPU where they flush the precision of each intermediate operation to 32-bit. This is why you see the pattern you see:
flds -4(%ebp)
fadds -8(%ebp) # i = a + b
fstps -32(%ebp)
It loads a single-precision value at the top of the FPU stack, implicitly extending that value to have 80-bit precision. This is the FLDS
instruction. Then, the FADDS
instruction does a combination load-and-add: it first loads a single-precision value, implicitly extending it to have 80-bit precision, and adds that to the value at the top of the FPU stack. Finally, it pops the result to a temporary location in memory, flushing it to a 32-bit, single-precision value.
You're entirely right that you won't get any parallelism with code like this. Even basic overlapping becomes impossible. But code like this is generated for precision, not for speed. All sorts of other optimizations are disabled, too, in the name of correctness.
If you want to prevent this and get the fastest floating-point code possible, even at the expense of correctness, then you need to pass a flag to indicate this to the compiler. On MSVC, this is /fp:fast
. On Gnu-style compilers, like GCC and Clang, this is -ffast-math
.
A couple of other related tips:
When you're analyzing compiler-generated disassembly, always make sure that you're looking at optimized code. Don't bother with unoptimized code; it's very noisy, will just confuse you, and doesn't match what a real assembly programmer would actually write. For MSVC, then, use the /O2
switch; for GCC/Clang, use the -O2
or -O3
switches.
Unless you just really like the AT&T syntax, configure your Gnu compiler or disassembler to emit Intel-format syntax listings. These will ensure that the output looks like the code you would see in Intel's manuals or other books on assembly-language programming. For the compiler, use the options -S -masm=intel
. For objdump
, use the options -d -M intel
. This isn't necessary with Microsoft's compiler, as it never uses AT&T syntax.
* Starting with the Pentium processor (circa 1993), integer instructions executed on the main part of the processor could be "paired". This was accomplished by the processor actually having two mostly-independent execution units, known as the "U" pipe and the "V" pipe. There were naturally some caveats to this pairing—the "V" pipe was more limited in the instructions that it could execute than the "U" pipe, and so certain instructions and certain combinations of instructions were non-pairable—but in general, this possibility of pairing doubled the effective bandwidth of the Pentium, making it significantly faster than its predecessor (the 486) on code that had been written accordingly. What I'm saying here is that, in contrast to the main integer side of the processor, the x87 FPU did not support this type of pairing.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With