I am trying to compute the function e^x (x is a single-precision float) in x87. To achieve this I am using the Taylor-series (as a reminder: e^x = 1 + (x^1)/1! + (x^2)/2! + ... + (x^n)/n!)
As I'm working with the x87, I can calculate all values in extended precision(80bit instead of the single-precision 32bit).
My realization so far is: I have the summand and the sum in two seperate registers in the FPU (plus some other less significant things). I have a loop, that keeps updating the summand and adding it to the total sum. So in vague pseudo-code:
loop:
;update my summand
n = n + 1
summand = summand / n
summand = summand * x
;add the summand to the total sum
sum = sum + summand
My problem now is the exiting condition of the loop. I wanted to design it in a way that it would exit the loop once adding the summand to the total sum would not impact the value of the sum in single-precision, even though I am figuring out the hard way that implementing such an exit condition is very complex and uses up a lot of instructions -> computing time.
My only kind of okayish idea so far would be: 1.: Get the exponents of the sum and summand via FXTRACT. If (exp(sum) - exp(summand)) > 23, the summand will not impact the bit representation in single-precision any longer (because the mantissa in single-precision has 23 bits) --> so exit. 2.: Compare the summand with 0, if it is zero it will obviously not impact the result anymore either.
My question is, whether somebody would have any more efficient ideas than me for the exiting condition?
Perhaps the most efficient way to compute ex using the x87 floating-point unit is the following sequence of instructions:
; Computes e^x via the formula 2^(x * log2(e))
fldl2e ; st(0) = log2(e) <-- load log2(e)
fmul [x] ; st(0) = x * log2(e)
fld1 ; st(0) = 1 <-- load 1
; st(1) = x * log2(e)
fld st(1) ; st(0) = x * log2(e) <-- make copy of intermediate result
; st(1) = 1
; st(2) = x * log2(e)
fprem ; st(0) = partial remainder((x * log2(e)) / 1) <-- call this "rem"
; st(1) = 1
; st(2) = x * log2(e)
f2xm1 ; st(0) = 2^(rem) - 1
; st(1) = 1
; st(2) = x * log2(e)
faddp st(1), st(0) ; st(0) = 2^(rem) - 1 + 1 = 2^(rem)
; st(1) = x * log2(e)
fscale ; st(0) = 2^(rem) * 2^(trunc(x * log2(e)))
; st(1) = x * log2(e)
fstp st(1)
The result is left in st(0)
.
If you already have the input, x
, on the floating-point stack, then you can modify the sequence of instructions slightly:
; st(0) == x
fldl2e
fmulp st(1), st(0)
fld1
fld st(1)
fprem
f2xm1
faddp st(1), st(0)
fscale
fstp st(1)
This multiplies x by log2e, finds the remainder of that divided by 1, exponentiates 2 to the power of that value (f2xm1
also subtracts 1, then we add 1 back in), and finally scales that by x × log2e.
An alternative implementation—essentially that suggested by fuz and reminiscent of the code generated by MSVC for the exp
function from the C standard library—is the following:
; st(0) == x
fldl2e
fmulp st(1), st(0)
fld st(0)
frndint
fsub st(1), st(0)
fxch ; pairable with FSUB on Pentium (P5)
f2xm1
fld1
faddp st(1), st(0)
fscale
fstp st(1)
The primary difference being the use of frndint
and fsub
to get a value in the range of −1.0 to +1.0, as required by f2xm1
, as opposed to using fprem
to get the remainder after a division by 1.
To get an idea of the relative costs of these instructions, we'll pull data from Agner Fog's instruction tables. A "?" indicates that the corresponding data is unavailable.
Instruction AMD K7 AMD K8 AMD K10 Bulldozer Piledriver Ryzen
-------------------------------------------------------------------------------------------
FLD/FLD1/FLDL2E [all very fast, 1-cycle instructions, with a reciprocal throughput of 1]
FADD(P)/FSUB(P) 1/4/1 1/4/1 1/4/1 1/5-6/1 1/5-6/1 1/5/1
FMUL(P) 1/4/1 1/4/1 1/4/1 1/5-6/1 1/5-6/1 1/5/1
FPREM 1/7-10/8 1/7-10/8 1/?/7 1/19-62/? 1/17-60/? 2/?/12-50
FRNDINT 5/10/3 5/10/3 6/?/37 1/4/1 1/4/1 1/4/3
FSCALE 5/8/? 5/8/? 5/9/29 8/52/? 8/44/5 8/9/4
F2XM1 8/27/? 53/126/? 8/65/30? 10/64-71/? 10/60-73/? 10/50/?
The notation used above is "ops/latency/reciprocal throughput".
Instruction 8087 287 387 486 P5 P6 PM Nehalem
-------------------------------------------------------------------------------------------
FLD 17-22 17-22 14 4 1/0 1/? 1/1 1/1
FLD1 15-21 15-21 24 4 2/0 2/? 2/? 2/?
FLDL2E 15-21 15-21 40 8 5/2 2/? 2/? 2/?
FADD(P)/FSUB(P) 70-100 70-100 23-34 8-20 3/2 1/3 1/3 1/3
FMUL(P) 90-145 90-145 29-57 16 3/2 1/5 1/5 1/5
FPREM 15-190 15-190 74-155 70-138 16-64 (2) 23/? 26/37 25/14
FRNDINT 16-50 16-50 66-80 21-30 9-20 (0) 30/? 15/19 17/22
FSCALE 32-38 32-38 67-86 30-32 20-32 (5) 56/? 28/43 24/12
F2XM1 310-630 310-630 211-476 140-279 13-57 (2) 17-48/66 ~20/? 19/58
For pre-P5, the value is just cycle counts. For P5, the notation is cycle counts, with the parenthetical indicating overlap with other FP instructions. For P6 and later, the notation is µops/latency.
Clearly, f2xm1
is a slow, expensive instruction, but this used by both implementations and hard to avoid. As slow as it is, it's still way faster than implementing this as a loop.
(One possible way around the slow f2xm1
instruction—if you are willing to sacrifice code size for speed—would be a look-up table-based approach. You can find several papers published on this if you search, although most of them are behind paywalls. :-( Here's one publicly-accessible reference. In fact, this is probably what an optimized math library written in C would do for its implementation of exp
. Ben Voigt wrote similar code in C#. The basic question is here, as always, are you optimizing for size or speed? In other words, are you calling this function frequently and need it to be as fast as possible? Or are you calling it infrequently and just need it to be precise, without consuming too much space in the binary and potentially slowing down other code on the hot path by evicting it from the cache.)
But what about fprem
vs. frndint
?
Well, on AMD CPUs, fprem
pretty consistently decodes to fewer operations than frndint
, although frndint
is getting faster than fprem
on modern AMD CPUs. Then again, I'd say that's irrelevant, because on those CPUs, you'd be writing SSE2/AVX code, not using the legacy x87 FPU. (These instructions also get slower on later model Intel microarchitectures, like Sandy Bridge, Haswell, etc., but the same argument applies there, so I've ignored those altogether in compiling the data.) What really matters is how this code performs on older CPUs, and on older CPUs, the version that uses fprem
looks like a clear winner.
On Intel CPUs, however, the reverse appears to be true: frndint
is generally faster than fprem
, except on P6 (Pentium Pro, Pentium II, and Pentium III).
But, we're not just talking about frndint
vs fprem
—we're actually talking about fprndint
+fsub
vs. fprem
. fsub
is a relatively fast instruction, but it starts to get difficult to predict just what performance this code is going to have without executing it and timing it. Cycle counts can only tell us so much, and the fprem
sequence is shorter overall (fewer instructions and, more importantly, fewer bytes—18 vs. 22), which can mean substantial speed improvements.
Either implementation is good, if you don't want to bother benchmarking it, or you want something generic that runs well on all CPUs. Neither of them are going to set the world on fire in terms of performance, but as I said before, both will be significantly faster than a loop that tries to calculate a Taylor series. The overhead there is what will kill your performance.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With