We find various tricks to replace <code>std::sqrt</code> (Timing Square Root) and some for <code>std::exp</code> (Using Faster Exponential Approximation) , but I find nothing to replace <code>std::log</code>. It's part of loops in my program and its called multiple times and while exp and sqrt were optimized, Intel VTune now suggest me to optimize <code>std::log</code>, after that it seems that only my design choices will be limiting. For now I use a 3rd order taylor approximation of <code>ln(1+x)</code> with <code>x</code> between <code>-0.5</code> and <code>+0.5</code> (90% of the case for max error of 4%) and fall back to <code>std::log</code> otherwise. This gave me 15% speed-up.

Prior to embarking on the design and deployment of a customized implementation of a transcendental function for performance, it is highly advisable to pursue optimizations at the algorithmic level, as well as through the toolchain. Unfortunately, we do not have any information about the code to be optimized here, nor do we have information on the toolchain. At the algorithmic level, check whether all calls to transcendental functions are truly necessary. Maybe there is a mathematical transformation that requires fewer function calls, or converts transcendental functions into algebraic operation. Are any of the transcendental function calls possibly redundant, e.g. because the computation is unnecessarily switching in and out of logarithmic space? If the accuracy requirements are modest, can the whole computation be performed in single precision, using <code>float</code> instead of <code>double</code> throughout? On most hardware platforms, avoiding <code>double</code> computation can lead to a significant performance boost. Compilers tend to offer a variety of switches that affect the performance of numerically intensive code. In addition to increasing the general optimization level to <code>-O3</code>, there is often a way to turn off denormal support, i.e. turn on flush-to-zero, or FTZ, mode. This has performance benefits on various hardware platforms. In addition, there is often a "fast math" flag whose use results in slightly reduced accuracy and eliminates overhead for handling special cases such as NaNs and infinities, plus the handling of <code>errno</code>. Some compilers also support auto-vectorization of code and ship with a SIMD math library, for example the Intel compiler. A custom implementation of a logarithm function typically involves separating a binary floating-point argument <code>x</code> into an exponent <code>e</code> and a mantissa <code>m</code>, such that <code>x = m * 2</code><code>e</code>, therefore <code>log(x) = log(2) * e + log(m)</code>. <code>m</code> is chosen such that it is close to unity, because this provides for efficient approximations, for example <code>log(m) = log(1+f) = log1p(f)</code> by minimax polynomial approximation. C++ provides the <code>frexp()</code> function to separate a floating-point operand into mantissa and exponent, but in practice one typically uses faster machine-specific methods that manipulate floating-point data at the bit level by re-interpreting them as same-size integers. The code below for the single-precision logarithm, <code>logf()</code>, demonstrates both variants. Functions <code>__int_as_float()</code> and <code>__float_as_int()</code> provide for the reinterpretation of an <code>int32_t</code> into an IEEE-754 <code>binary32</code> floating-point number and vice-versa. This code heavily relies on the fused multiply-add operation FMA supported directly in the hardware on most current processors, CPU or GPU. On platforms where <code>fmaf()</code> maps to software emulation, this code will be unacceptably slow. <pre class="prettyprint lang-cpp prettyprint-override"><code>#include <cmath> #include <cstdint> #include <cstring> float __int_as_float (int32_t a) { float r; memcpy (&r, &a, sizeof r); return r;} int32_t __float_as_int (float a) { int32_t r; memcpy (&r, &a, sizeof r); return r;} /* compute natural logarithm, maximum error 0.85089 ulps */ float my_logf (float a) { float i, m, r, s, t; int e; #if PORTABLE m = frexpf (a, &e); if (m < 0.666666667f) { m = m + m; e = e - 1; } i = (float)e; #else // PORTABLE i = 0.0f; if (a < 1.175494351e-38f){ // 0x1.0p-126 a = a * 8388608.0f; // 0x1.0p+23 i = -23.0f; } e = (__float_as_int (a) - __float_as_int (0.666666667f)) & 0xff800000; m = __int_as_float (__float_as_int (a) - e); i = fmaf ((float)e, 1.19209290e-7f, i); // 0x1.0p-23 #endif // PORTABLE /* m in [2/3, 4/3] */ m = m - 1.0f; s = m * m; /* Compute log1p(m) for m in [-1/3, 1/3] */ r = -0.130310059f; // -0x1.0ae000p-3 t = 0.140869141f; // 0x1.208000p-3 r = fmaf (r, s, -0.121483512f); // -0x1.f198b2p-4 t = fmaf (t, s, 0.139814854f); // 0x1.1e5740p-3 r = fmaf (r, s, -0.166846126f); // -0x1.55b36cp-3 t = fmaf (t, s, 0.200120345f); // 0x1.99d8b2p-3 r = fmaf (r, s, -0.249996200f); // -0x1.fffe02p-3 r = fmaf (t, m, r); r = fmaf (r, m, 0.333331972f); // 0x1.5554fap-2 r = fmaf (r, m, -0.500000000f); // -0x1.000000p-1 r = fmaf (r, s, m); r = fmaf (i, 0.693147182f, r); // 0x1.62e430p-1 // log(2) if (!((a > 0.0f) && (a < INFINITY))) { r = a + a; // silence NaNs if necessary if (a < 0.0f) r = INFINITY - INFINITY; // NaN if (a == 0.0f) r = -INFINITY; } return r; } </code></pre> As noted in the code comment, the implementation above provides faithfully-rounded single-precision results, and it deals with exceptional cases consistent with the IEEE-754 floating-point standard. The performance can be increased further by eliminating special-case support, eliminating the support for denormal arguments, and reducing the accuracy. This leads to the following exemplary variant: <pre class="prettyprint"><code>/* natural log on [0x1.f7a5ecp-127, 0x1.fffffep127]. Maximum relative error 9.4529e-5 */ float my_faster_logf (float a) { float m, r, s, t, i, f; int32_t e; e = (__float_as_int (a) - 0x3f2aaaab) & 0xff800000; m = __int_as_float (__float_as_int (a) - e); i = (float)e * 1.19209290e-7f; // 0x1.0p-23 /* m in [2/3, 4/3] */ f = m - 1.0f; s = f * f; /* Compute log1p(f) for f in [-1/3, 1/3] */ r = fmaf (0.230836749f, f, -0.279208571f); // 0x1.d8c0f0p-3, -0x1.1de8dap-2 t = fmaf (0.331826031f, f, -0.498910338f); // 0x1.53ca34p-2, -0x1.fee25ap-2 r = fmaf (r, s, t); r = fmaf (r, s, f); r = fmaf (i, 0.693147182f, r); // 0x1.62e430p-1 // log(2) return r; } </code></pre>

Very fast approximate Logarithm (natural log) function in C++?

Tags:

c++

math

logarithm

micro-optimization

sqrt

We find various tricks to replace std::sqrt (Timing Square Root) and some for std::exp (Using Faster Exponential Approximation) , but I find nothing to replace std::log.

It's part of loops in my program and its called multiple times and while exp and sqrt were optimized, Intel VTune now suggest me to optimize std::log, after that it seems that only my design choices will be limiting.

For now I use a 3rd order taylor approximation of ln(1+x) with x between -0.5 and +0.5 (90% of the case for max error of 4%) and fall back to std::log otherwise. This gave me 15% speed-up.

239

asked Oct 02 '16 20:10

qwark

1 Answers

Prior to embarking on the design and deployment of a customized implementation of a transcendental function for performance, it is highly advisable to pursue optimizations at the algorithmic level, as well as through the toolchain. Unfortunately, we do not have any information about the code to be optimized here, nor do we have information on the toolchain.

At the algorithmic level, check whether all calls to transcendental functions are truly necessary. Maybe there is a mathematical transformation that requires fewer function calls, or converts transcendental functions into algebraic operation. Are any of the transcendental function calls possibly redundant, e.g. because the computation is unnecessarily switching in and out of logarithmic space? If the accuracy requirements are modest, can the whole computation be performed in single precision, using float instead of double throughout? On most hardware platforms, avoiding double computation can lead to a significant performance boost.

Compilers tend to offer a variety of switches that affect the performance of numerically intensive code. In addition to increasing the general optimization level to -O3, there is often a way to turn off denormal support, i.e. turn on flush-to-zero, or FTZ, mode. This has performance benefits on various hardware platforms. In addition, there is often a "fast math" flag whose use results in slightly reduced accuracy and eliminates overhead for handling special cases such as NaNs and infinities, plus the handling of errno. Some compilers also support auto-vectorization of code and ship with a SIMD math library, for example the Intel compiler.

A custom implementation of a logarithm function typically involves separating a binary floating-point argument x into an exponent e and a mantissa m, such that x = m * 2^e, therefore log(x) = log(2) * e + log(m). m is chosen such that it is close to unity, because this provides for efficient approximations, for example log(m) = log(1+f) = log1p(f) by minimax polynomial approximation.

C++ provides the frexp() function to separate a floating-point operand into mantissa and exponent, but in practice one typically uses faster machine-specific methods that manipulate floating-point data at the bit level by re-interpreting them as same-size integers. The code below for the single-precision logarithm, logf(), demonstrates both variants. Functions __int_as_float() and __float_as_int() provide for the reinterpretation of an int32_t into an IEEE-754 binary32 floating-point number and vice-versa. This code heavily relies on the fused multiply-add operation FMA supported directly in the hardware on most current processors, CPU or GPU. On platforms where fmaf() maps to software emulation, this code will be unacceptably slow.

#include <cmath>
#include <cstdint>
#include <cstring>

float __int_as_float (int32_t a) { float r; memcpy (&r, &a, sizeof r); return r;}
int32_t __float_as_int (float a) { int32_t r; memcpy (&r, &a, sizeof r); return r;}

/* compute natural logarithm, maximum error 0.85089 ulps */
float my_logf (float a)
{
    float i, m, r, s, t;
    int e;

#if PORTABLE
    m = frexpf (a, &e);
    if (m < 0.666666667f) {
        m = m + m;
        e = e - 1;
    }
    i = (float)e;
#else // PORTABLE
    i = 0.0f;
    if (a < 1.175494351e-38f){ // 0x1.0p-126
        a = a * 8388608.0f; // 0x1.0p+23
        i = -23.0f;
    }
    e = (__float_as_int (a) - __float_as_int (0.666666667f)) & 0xff800000;
    m = __int_as_float (__float_as_int (a) - e);
    i = fmaf ((float)e, 1.19209290e-7f, i); // 0x1.0p-23
#endif // PORTABLE
    /* m in [2/3, 4/3] */
    m = m - 1.0f;
    s = m * m;
    /* Compute log1p(m) for m in [-1/3, 1/3] */
    r =             -0.130310059f;  // -0x1.0ae000p-3
    t =              0.140869141f;  //  0x1.208000p-3
    r = fmaf (r, s, -0.121483512f); // -0x1.f198b2p-4
    t = fmaf (t, s,  0.139814854f); //  0x1.1e5740p-3
    r = fmaf (r, s, -0.166846126f); // -0x1.55b36cp-3
    t = fmaf (t, s,  0.200120345f); //  0x1.99d8b2p-3
    r = fmaf (r, s, -0.249996200f); // -0x1.fffe02p-3
    r = fmaf (t, m, r);
    r = fmaf (r, m,  0.333331972f); //  0x1.5554fap-2
    r = fmaf (r, m, -0.500000000f); // -0x1.000000p-1  
    r = fmaf (r, s, m);
    r = fmaf (i,  0.693147182f, r); //  0x1.62e430p-1 // log(2)
    if (!((a > 0.0f) && (a < INFINITY))) {
        r = a + a;  // silence NaNs if necessary
        if (a  < 0.0f) r = INFINITY - INFINITY; //  NaN
        if (a == 0.0f) r = -INFINITY;
    }
    return r;
}

As noted in the code comment, the implementation above provides faithfully-rounded single-precision results, and it deals with exceptional cases consistent with the IEEE-754 floating-point standard. The performance can be increased further by eliminating special-case support, eliminating the support for denormal arguments, and reducing the accuracy. This leads to the following exemplary variant:

/* natural log on [0x1.f7a5ecp-127, 0x1.fffffep127]. Maximum relative error 9.4529e-5 */
float my_faster_logf (float a)
{
    float m, r, s, t, i, f;
    int32_t e;

    e = (__float_as_int (a) - 0x3f2aaaab) & 0xff800000;
    m = __int_as_float (__float_as_int (a) - e);
    i = (float)e * 1.19209290e-7f; // 0x1.0p-23
    /* m in [2/3, 4/3] */
    f = m - 1.0f;
    s = f * f;
    /* Compute log1p(f) for f in [-1/3, 1/3] */
    r = fmaf (0.230836749f, f, -0.279208571f); // 0x1.d8c0f0p-3, -0x1.1de8dap-2
    t = fmaf (0.331826031f, f, -0.498910338f); // 0x1.53ca34p-2, -0x1.fee25ap-2
    r = fmaf (r, s, t);
    r = fmaf (r, s, f);
    r = fmaf (i, 0.693147182f, r); // 0x1.62e430p-1 // log(2) 
    return r;
}

answered Oct 13 '22 10:10

njuffa

Related questions
                            
                                Difference between read_some/write_some and receive/send?
                            
                                error C2733 second C linkage of overloaded function 'function' not allowed
                            
                                Cout not printing number
                            
                                C++ pass list as a parameter to a function
                            
                                C++ symbol scope search order different for template and non-template class?
                            
                                What are the differences between the type inference of Scala and C++11?
                            
                                Repeated destructor calls and tracking handles in C++/CLI
                            
                                Const and non const template specialization
                            
                                typeid for polymorphic pointers?
                            
                                Is there a downside to making ALL inheritance virtual?
                            
                                Is there any way to asynchronously wait for a future in Boost Asio?
                            
                                Is static_cast<T>(...) compile-time or run-time?
                            
                                OpenCV Mat::ones function
                            
                                Passing shared_ptr to lambda by value leaks memory
                            
                                Initialize char**
                            
                                Why is an ellipsis preferred to a variadic template when called with no arguments?
                            
                                Lambda expressions with CUDA
                            
                                Why shouldn't the inherited constructor inherit the default arguments?
                            
                                On which platforms does integer divide by zero trigger a floating point exception?
                            
                                Where to define struct only to be used as private member variable?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With