Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

x87 FPU computing e powered x, maybe with a Taylor series?

I am trying to compute the function e^x (x is a single-precision float) in x87. To achieve this I am using the Taylor-series (as a reminder: e^x = 1 + (x^1)/1! + (x^2)/2! + ... + (x^n)/n!)

As I'm working with the x87, I can calculate all values in extended precision(80bit instead of the single-precision 32bit).

My realization so far is: I have the summand and the sum in two seperate registers in the FPU (plus some other less significant things). I have a loop, that keeps updating the summand and adding it to the total sum. So in vague pseudo-code:

loop:
;update my summand
n = n + 1
summand = summand / n
summand = summand * x
;add the summand to the total sum
sum = sum + summand

My problem now is the exiting condition of the loop. I wanted to design it in a way that it would exit the loop once adding the summand to the total sum would not impact the value of the sum in single-precision, even though I am figuring out the hard way that implementing such an exit condition is very complex and uses up a lot of instructions -> computing time.

My only kind of okayish idea so far would be: 1.: Get the exponents of the sum and summand via FXTRACT. If (exp(sum) - exp(summand)) > 23, the summand will not impact the bit representation in single-precision any longer (because the mantissa in single-precision has 23 bits) --> so exit. 2.: Compare the summand with 0, if it is zero it will obviously not impact the result anymore either.

My question is, whether somebody would have any more efficient ideas than me for the exiting condition?

like image 798
Maru Avatar asked Mar 09 '23 16:03

Maru


1 Answers

Perhaps the most efficient way to compute ex using the x87 floating-point unit is the following sequence of instructions:

; Computes e^x via the formula 2^(x * log2(e))
fldl2e                  ; st(0) = log2(e)        <-- load log2(e)
fmul    [x]             ; st(0) = x * log2(e)
fld1                    ; st(0) = 1              <-- load 1
                        ; st(1) = x * log2(e)
fld     st(1)           ; st(0) = x * log2(e)    <-- make copy of intermediate result
                        ; st(1) = 1
                        ; st(2) = x * log2(e)
fprem                   ; st(0) = partial remainder((x * log2(e)) / 1)  <-- call this "rem"
                        ; st(1) = 1
                        ; st(2) = x * log2(e)
f2xm1                   ; st(0) = 2^(rem) - 1
                        ; st(1) = 1
                        ; st(2) = x * log2(e)
faddp   st(1), st(0)    ; st(0) = 2^(rem) - 1 + 1 = 2^(rem)
                        ; st(1) = x * log2(e)
fscale                  ; st(0) = 2^(rem) * 2^(trunc(x * log2(e)))
                        ; st(1) = x * log2(e)
fstp    st(1)

The result is left in st(0).

If you already have the input, x, on the floating-point stack, then you can modify the sequence of instructions slightly:

; st(0) == x
fldl2e
fmulp   st(1), st(0)
fld1
fld     st(1)
fprem
f2xm1
faddp   st(1), st(0)
fscale
fstp    st(1)

This multiplies x by log2e, finds the remainder of that divided by 1, exponentiates 2 to the power of that value (f2xm1 also subtracts 1, then we add 1 back in), and finally scales that by x × log2e.


An alternative implementation—essentially that suggested by fuz and reminiscent of the code generated by MSVC for the exp function from the C standard library—is the following:

; st(0) == x
fldl2e
fmulp    st(1), st(0)
fld      st(0)
frndint
fsub     st(1), st(0)
fxch                         ; pairable with FSUB on Pentium (P5)
f2xm1
fld1
faddp    st(1), st(0)
fscale
fstp     st(1)

The primary difference being the use of frndint and fsub to get a value in the range of −1.0 to &plus;1.0, as required by f2xm1, as opposed to using fprem to get the remainder after a division by 1.

To get an idea of the relative costs of these instructions, we'll pull data from Agner Fog's instruction tables. A "?" indicates that the corresponding data is unavailable.

Instruction      AMD K7      AMD K8      AMD K10    Bulldozer    Piledriver    Ryzen
-------------------------------------------------------------------------------------------
FLD/FLD1/FLDL2E  [all very fast, 1-cycle instructions, with a reciprocal throughput of 1]
FADD(P)/FSUB(P)  1/4/1       1/4/1       1/4/1      1/5-6/1      1/5-6/1       1/5/1
FMUL(P)          1/4/1       1/4/1       1/4/1      1/5-6/1      1/5-6/1       1/5/1
FPREM            1/7-10/8    1/7-10/8    1/?/7      1/19-62/?    1/17-60/?     2/?/12-50
FRNDINT          5/10/3      5/10/3      6/?/37     1/4/1        1/4/1         1/4/3
FSCALE           5/8/?       5/8/?       5/9/29     8/52/?       8/44/5        8/9/4
F2XM1            8/27/?      53/126/?    8/65/30?   10/64-71/?   10/60-73/?    10/50/?

The notation used above is "ops/latency/reciprocal throughput".

Instruction      8087     287      387      486      P5         P6        PM      Nehalem
-------------------------------------------------------------------------------------------
FLD              17-22    17-22    14       4        1/0        1/?       1/1      1/1
FLD1             15-21    15-21    24       4        2/0        2/?       2/?      2/?
FLDL2E           15-21    15-21    40       8        5/2        2/?       2/?      2/?
FADD(P)/FSUB(P)  70-100   70-100   23-34    8-20     3/2        1/3       1/3      1/3
FMUL(P)          90-145   90-145   29-57    16       3/2        1/5       1/5      1/5
FPREM            15-190   15-190   74-155   70-138   16-64 (2)  23/?      26/37    25/14
FRNDINT          16-50    16-50    66-80    21-30    9-20 (0)   30/?      15/19    17/22
FSCALE           32-38    32-38    67-86    30-32    20-32 (5)  56/?      28/43    24/12
F2XM1            310-630  310-630  211-476  140-279  13-57 (2)  17-48/66  ~20/?    19/58

For pre-P5, the value is just cycle counts. For P5, the notation is cycle counts, with the parenthetical indicating overlap with other FP instructions. For P6 and later, the notation is µops/latency.

Clearly, f2xm1 is a slow, expensive instruction, but this used by both implementations and hard to avoid. As slow as it is, it's still way faster than implementing this as a loop.

(One possible way around the slow f2xm1 instruction—if you are willing to sacrifice code size for speed—would be a look-up table-based approach. You can find several papers published on this if you search, although most of them are behind paywalls. :-( Here's one publicly-accessible reference. In fact, this is probably what an optimized math library written in C would do for its implementation of exp. Ben Voigt wrote similar code in C#. The basic question is here, as always, are you optimizing for size or speed? In other words, are you calling this function frequently and need it to be as fast as possible? Or are you calling it infrequently and just need it to be precise, without consuming too much space in the binary and potentially slowing down other code on the hot path by evicting it from the cache.)

But what about fprem vs. frndint?

Well, on AMD CPUs, fprem pretty consistently decodes to fewer operations than frndint, although frndint is getting faster than fprem on modern AMD CPUs. Then again, I'd say that's irrelevant, because on those CPUs, you'd be writing SSE2/AVX code, not using the legacy x87 FPU. (These instructions also get slower on later model Intel microarchitectures, like Sandy Bridge, Haswell, etc., but the same argument applies there, so I've ignored those altogether in compiling the data.) What really matters is how this code performs on older CPUs, and on older CPUs, the version that uses fprem looks like a clear winner.

On Intel CPUs, however, the reverse appears to be true: frndint is generally faster than fprem, except on P6 (Pentium Pro, Pentium II, and Pentium III).

But, we're not just talking about frndint vs fprem—we're actually talking about fprndint+fsub vs. fprem. fsub is a relatively fast instruction, but it starts to get difficult to predict just what performance this code is going to have without executing it and timing it. Cycle counts can only tell us so much, and the fprem sequence is shorter overall (fewer instructions and, more importantly, fewer bytes—18 vs. 22), which can mean substantial speed improvements.

Either implementation is good, if you don't want to bother benchmarking it, or you want something generic that runs well on all CPUs. Neither of them are going to set the world on fire in terms of performance, but as I said before, both will be significantly faster than a loop that tries to calculate a Taylor series. The overhead there is what will kill your performance.

like image 84
Cody Gray Avatar answered Mar 19 '23 05:03

Cody Gray