Efficient faithfully-rounded implementation of error function erff()

Question

The error function is closely related to the standard normal distribution and occurs frequently in the natural sciences as well as other fields. It is used in finance when pricing options, for example. While support for it was added first to ISO C99 and subsequently to C++ in the form of the functions erf(), erff(), it was until recently absent from at least one popular C/C+ toolchain. Many projects still use their own implementations of the error function, often based on approximations from older literature, such as Abramowitz and Stegun, which in turn go back to

Cecil Hastings Jr, "Approximations for Digital Computers". Princeton University Press, 1955

In modern computing, faithfully-rounded implementations of transcendental functions are often seen as a minimal standard of accuracy for math libraries; such a standard still allows for high-performance implementations. A function is called faithfully rounded when it returns results with a maximum error of less than 1 ulp compared with the mathematical value across the entire input domain. Older published algorithms do not provide faithfully-rounded results when implemented with exclusive use of IEEE-754 single precision operations.

Modern computer hardware provides an floating-point operation called fused multiply-add (or FMA for short) that computes a floating-point multiplication followed by a dependent floating-point addition such that the full unrounded product is used in the addition, and only a single rounding occurs at the end of the operation. This fused operation, introduced by IBM in 1990, provides higher accuracy and higher performance in many computations. It is available on the two most prevalent CPU architectures today (ARM and x86) as well as on GPUs. It has been exposed in C and C++ via the fmaf() and fmaf() functions.

Assuming that FMA is natively supported by hardware, how would one construct a single-precision error function erff() which is both faithfully rounded and efficient? Preferrably, the code should be vectorizable, possibly after minor code modification.

njuffa · Accepted Answer

Looking at the graph of the error function, one observes that the function is symmetric about the origin; approximations can therefore trivially be restricted to the positive half-plane. Furthermore, the graph could be split into two segments, with the boundary between these somewhere near x = 1. In the segment closer to the origin, the error function is fairly linear and vertically dominated, while in the segment further away from the origin it is horizontally dominated and asymptotically approaches unity in exponentially decaying fashion.

A reasonable conclusion is that a simple polynomial approximation x * p (x) is suitable for the segment close to zero, while the other segment would be well approximated by 1 - exp (x * q (x)), where q (x) is a second polynomial approximation. Based on the Taylor series expansion of the error function, the approximation for the segment near the origin should actually be of the form x * p (x²).

The first task then is to find the switch-over point between the two segments. I used an experimental approach for this, starting with the switch-over point at 0.875 and incrementally marching it towards 1.0. For each value of the switch-over point, I generated an initial minimax approximation to the error function between zero and the switch-over point with the Remez algorithm. These were then further improved, in terms of accuracy, using search heuristics for the coefficient values based on evaluation of the polynomial via Horner scheme with use of FMA operations. Incrementing the switch-over point was repeated until the maximum error of the resulting approximation exceeded 1 ulp. By this process I determined the optimal boundary between the two approximation segments as 0.921875. This results in an approximation x * p (x²) that achieves a maximum error of barely less than 1 ulp.

The Remez algorithm was also used to provide an initial minimax computation for the polynomial q (x). It became clear early on that there is a fair amount of interaction between q (x) and the approximation used internally to exp() in the way their errors reinforce or compensate each other. This means the best choice for the coefficients of q (x) will be tightly tied to the implementation of exp() and must be taken into account as part of the heuristic refinement of the initial set of coefficients. I therefore decided to use my own implementation of expf() so as to isolate myself from any particular library implementation. As a minimum requirement, expf() itself needs to be faithfully rounded, and probably must conform to a slightly tighter error bound for this approach to work, although I did not try to determine exactly how tight. In this case my own implementation of expf() provides an error bound of 0.87161 ulps, which turned out to be sufficient.

Since the segment requiring use of exp () is the slow path, I chose to use Estrin's scheme for the low-order terms of the polynomial q (x) in order to increase instruction-level parallelism and thus performance. The impact on accuracy of doing so is negligible. For accuracy reasons, the Horner scheme must be utilized for the high-order terms of the polynomial. Looking at the least-significant coefficients of both polynomials, one observes that both are 1.128..., and we can therefore improve accuracy slightly by splitting the coefficient into (1 + 0.128...), which facilitates the use of FMA to perform the final multiplication with x.

In the end I was able to achieve an implementation of erff() where each of the two code paths achieves a maximum error of just under 1 ulp, as established by exhaustive test against a higher-precision reference. The function is therefore faithfully rounded. Use of FMA is a crucial component of this success. Depending on toolchain, the C99 code shown below may be vectorizable as-is, or one could modify it manually such that both code paths are computed concurrently with a selection of the desired result at the end. High-performance math libraries include a vectorizable version of expf() which should be used instead of my custom function my_expf(). Not all vectorized implementations of expf() offer sufficient accuracy though, and for others tuning of the coefficients in polynomial q(x) will be necessary.

If a custom version of expf() is used, like I did here, one would probably want to replace the call to ldexpf() with faster machine-specific code for performance reasons.

/* Compute exponential base e. Maximum ulp error = 0.87161 */
float my_expf (float a)
{
    float c, f, r;
    int i;

    // exp(a) = exp(i + f); i = rint (a / log(2)) 
    c = 0x1.800000p+23f; // 1.25829120e+7
    r = fmaf (0x1.715476p+0f, a, c) - c; // 1.44269502e+0
    f = fmaf (r, -0x1.62e400p-01f, a); // -6.93145752e-1 // log_2_hi 
    f = fmaf (r, -0x1.7f7d1cp-20f, f); // -1.42860677e-6 // log_2_lo
    i = (int)r;
    // approximate r = exp(f) on interval [-log(2)/2,+log(2)/2]
    r =             0x1.6a98dap-10f;  // 1.38319808e-3
    r = fmaf (r, f, 0x1.1272cap-07f); // 8.37550033e-3
    r = fmaf (r, f, 0x1.555a20p-05f); // 4.16689515e-2
    r = fmaf (r, f, 0x1.55542ep-03f); // 1.66664466e-1
    r = fmaf (r, f, 0x1.fffff6p-02f); // 4.99999851e-1
    r = fmaf (r, f, 0x1.000000p+00f); // 1.00000000e+0
    r = fmaf (r, f, 0x1.000000p+00f); // 1.00000000e+0
    // exp(a) = 2**i * exp(f);
    r = ldexpf (r, i);
    // handle special cases
    if (!(fabsf (a) < 104.0f)) {
        r = a + a; // handle NaNs
        if (a < 0.0f) r = 0.0f;
        if (a > 0.0f) r = 1e38f * 1e38f; // + INF
    }
    return r;
}

/* compute error function. max ulp error = 0.99993 */
float my_erff (float a)
{
    float r, s, t, u;

    t = fabsf (a);
    s = a * a;
    if (t > 0.921875f) { // 0.99527 ulp
        r = fmaf (0x1.222900p-16f, t, -0x1.91d2ccp-12f); // 1.72948930e-5, -3.83208680e-4
        u = fmaf (0x1.fd1336p-09f, t, -0x1.8d6300p-06f); // 3.88393435e-3, -2.42545605e-2
        r = fmaf (r, s, u);
        r = fmaf (r, t, 0x1.b55cb0p-4f); // 1.06777847e-1
        r = fmaf (r, t, 0x1.450aa0p-1f); // 6.34846687e-1
        r = fmaf (r, t, 0x1.079d0cp-3f); // 1.28717512e-1
        r = fmaf (r, t, t);
        r = my_expf (-r);
        r = 1.0f - r;
        r = copysignf (r, a);
    } else { // 0.99993 ulp
        r =             -0x1.3a1a82p-11f;  // -5.99104969e-4
        r = fmaf (r, s,  0x1.473f48p-08f); //  4.99339588e-3
        r = fmaf (r, s, -0x1.b68bd2p-06f); // -2.67667342e-2
        r = fmaf (r, s,  0x1.ce1a46p-04f); //  1.12818025e-1
        r = fmaf (r, s, -0x1.8126e0p-02f); // -3.76124859e-1
        r = fmaf (r, s,  0x1.06eba6p-03f); //  1.28379151e-1
        r = fmaf (r, a, a);
    } 
    return r;
}

Efficient faithfully-rounded implementation of error function erff()

Tags:

c

floating-point

njuffa

1 Answers

njuffa

Recent Activity

Donate For Us

Efficient faithfully-rounded implementation of error function erff()

Tags:

c

floating-point

njuffa

1 Answers

njuffa

Related questions

Recent Activity

Donate For Us