Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Fast 1/X division (reciprocal)

Is there some way to improve reciprocal (division 1 over X) with respect to speed, if the precision is not crucial?

So, I need to calculate 1/X. Is there some workaround so I lose precision but do it faster?

like image 956
klm123 Avatar asked Mar 30 '12 08:03

klm123


4 Answers

𝗛𝗲𝗿𝗲'𝘀 𝗛𝗼𝘄 𝗧𝗼 𝗔𝗽𝗽𝗿𝗼𝘅𝗶𝗺𝗮𝘁𝗲 𝗠𝗼𝗿𝗲 𝗘𝗳𝗳𝗶𝗰𝗶𝗲𝗻𝘁𝗹𝘆

I believe that what he was looking for is a more efficient way to approximate 1.0/x instead of some technical definition of approximation that states that you could use 1 as a very imprecise answer. I also believe that this satisfies that.

#ifdef __cplusplus
    #include <cstdint>
#else
    #include <stdint.h>
#endif

__inline__ double __attribute__((const)) reciprocal( double x ) {
    union {
        double dbl;
        #ifdef __cplusplus
            std::uint_least64_t ull;
        #else
            uint_least64_t ull;
        #endif
    } u;
    u.dbl = x;
    u.ull = ( 0xbfcdd6a18f6a6f52ULL - u.ull ) >> 1;
                                // pow( x, -0.5 )
    u.dbl *= u.dbl;             // pow( pow(x,-0.5), 2 ) = pow( x, -1 ) = 1.0 / x
    return u.dbl;
}
__inline__ float __attribute__((const)) reciprocal( float x ) {
    union {
        float single;
        #ifdef __cplusplus
            std::uint_least32_t uint;
        #else
            uint_least32_t uint;
        #endif
    } u;
    u.single = x;
    u.uint = ( 0xbe6eb3beU - u.uint ) >> 1;
                                // pow( x, -0.5 )
    u.single *= u.single;       // pow( pow(x,-0.5), 2 ) = pow( x, -1 ) = 1.0 / x
    return u.single;
}

Hmm....... I wounder if the CPU manufactures knew you could approximate the reciprocal with only a single multiply, subtraction, and bit-shift when they designed the CPU.... hmm.........

As for bench-marking, the hardware x2 instructions combined with the hardware subtraction instructions are just as fast as hardware 1.0/x instruction on modern day computers (my benchmarks were on an Intel i7, but I would assume similar results for other processors). However, if this algorithm were implemented into the hardware as a new assembly instruction, then the increase in speed would probably be good enough for this instruction to be quite practical.

For more information about this method, this implementation is based off the wonderful "fast" inverse square root algorithm.

As Pharap brought to my attention, reading an inactive property from a union is undefined behavior, so there are two possible solutions that I have devised from his helpful comments to avoid undefined behavior. The first solution seems more like an unsavory trick to get around a language semantic that is practically no better than the original solution.

#ifdef __cplusplus
    #include <cstdint>
#else
    #include <stdint.h>
#endif
__inline__ double __attribute__((const)) reciprocal( double x ) {
    union {
        double dbl[2];
        #ifdef __cplusplus
            std::uint_least64_t ull[2];
        #else
            uint_least64_t ull[2];
        #endif
    } u;
    u.dbl[0] = x; // dbl is now the active property, so only dbl can be read now
    u.ull[1] = 0;//trick to set ull to the active property so that ull can be read
    u.ull][0] = ( 0xbfcdd6a18f6a6f52ULL - u.ull[0] ) >> 1;
    u.dbl[1] = 0; // now set dbl to the active property so that it can be read
    u.dbl[0] *= u.dbl[0];
    return u.dbl[0];
}
__inline__ float __attribute__((const)) reciprocal( float x ) {
    union {
        float flt[2];
        #ifdef __cplusplus
            std::uint_least32_t ull[2];
        #else
            uint_least32_t ull[2];
        #endif
    } u;
    u.flt[0] = x; // now flt is active
    u.uint[1] = 0; // set uint to be active for reading and writing
    u.uint[0] = ( 0xbe6eb3beU - u.uint[0] ) >> 1;
    u.flt[1] = 0; // set flt to be active for reading and writing
    u.flt[0] *= u.flt[0];
    return u.flt[0];
}

The second possible solution is much more palatable because it gets rid of unions altogether. However, this solution will be much slower if it is not properly optimized by the compiler. But, on the upside, the solution below will be completely agnostic of byte-order provided:

  1. that bytes are 8-bits in width
  2. that bytes are the smallest atomic unit on the target machine.
  3. that doubles are 8-bytes wide and floats are 4-bytes wide.

#ifdef __cplusplus
    #include <cstdint>
    #include <cstring>
    #define stdIntWithEightBits std::uint8_t
    #define stdIntSizeOfFloat std::uint32_t
    #define stdIntSizeOfDouble std::uint64_t
#else
    #include <stdint.h>
    #include <string.h>
    #define stdIntWithEightBits uint8_t
    #define stdIntSizeOfFloat uint32_t
    #define stdIntSizeOfDouble uint64_t
#endif

__inline__ double __attribute__((const)) reciprocal( double x ) {
    double byteIndexFloat = 1.1212798184631136e-308;//00 08 10 18 20 28 30 38 bits
    stdIntWithEightBits* byteIndexs = reinterpret_cast<stdIntWithEightBits*>(&byteIndexFloat);
    
    stdIntWithEightBits* inputBytes = reinterpret_cast<stdIntWithEightBits*>(&x);
    
    stdIntSizeOfDouble inputAsUll = (
        (inputBytes[0] << byteIndexs[0]) |
        (inputBytes[1] << byteIndexs[1]) |
        (inputBytes[2] << byteIndexs[2]) |
        (inputBytes[3] << byteIndexs[3]) |
        (inputBytes[4] << byteIndexs[4]) |
        (inputBytes[5] << byteIndexs[5]) |
        (inputBytes[6] << byteIndexs[6]) |
        (inputBytes[7] << byteIndexs[7])
    );
    inputAsUll = ( 0xbfcdd6a18f6a6f52ULL - inputAsUll ) >> 1;
    
    double outputDouble;
    
    const stdIntWithEightBits outputBytes[] = {
        inputAsUll >> byteIndexs[0],
        inputAsUll >> byteIndexs[1],
        inputAsUll >> byteIndexs[2],
        inputAsUll >> byteIndexs[3],
        inputAsUll >> byteIndexs[4],
        inputAsUll >> byteIndexs[5],
        inputAsUll >> byteIndexs[6],
        inputAsUll >> byteIndexs[7]
    };
    memcpy(&outputDouble, &outputBytes, 8);
    
    return outputDouble * outputDouble;
}

__inline__ float __attribute__((const)) reciprocal( float x ) {
    float byteIndexFloat = 7.40457e-40; // 0x00 08 10 18 bits
    stdIntWithEightBits* byteIndexs = reinterpret_cast<stdIntWithEightBits*>(&byteIndexFloat);
    
    stdIntWithEightBits* inputBytes = reinterpret_cast<stdIntWithEightBits*>(&x);
    
    stdIntSizeOfFloat inputAsInt = (
        (inputBytes[0] << byteIndexs[0]) |
        (inputBytes[1] << byteIndexs[1]) |
        (inputBytes[2] << byteIndexs[2]) |
        (inputBytes[3] << byteIndexs[3])
    );
    inputAsInt = ( 0xbe6eb3beU - inputAsInt ) >> 1;
    
    float outputFloat;
    
    const stdIntWithEightBits outputBytes[] = {
        inputAsInt >> byteIndexs[0],
        inputAsInt >> byteIndexs[1],
        inputAsInt >> byteIndexs[2],
        inputAsInt >> byteIndexs[3]
    };
    memcpy(&outputFloat, &outputBytes, 4);
    
    return outputFloat * outputFloat;
}

Disclaimer: Lastly, please note that I am more of a novice in C++. As such, I welcome with wide open arms any best-practices, proper-formatting, or implication-clarity edits to the end of both improving the quality of this answer for all who read it and expanding my knowledge of C++ for all my years to come.

like image 58
Jack G Avatar answered Nov 01 '22 15:11

Jack G


First, make sure this isn't a case of premature optimization. Do you know that this is your bottleneck?

As Mystical says, 1/x can be calculated very quickly. Make sure you're not using double datatype for either the 1 or the divisor. Floats are much faster.

That said, benchmark, benchmark, benchmark. Don't waste your time spending hours on numerical theory just to discover the source of the poor performance is IO access.

like image 20
Mahmoud Al-Qudsi Avatar answered Nov 01 '22 14:11

Mahmoud Al-Qudsi


First of all, if you turn on compiler optimizations, the compiler is likely to optimize the calculation if possible (to pull it out of a loop, for example). To see this optimization, you need to build and run in Release mode.

Division may be heavier than multiplication (but a commenter pointed out that reciprocals are just as fast as multiplication on modern CPUs, in which case, this isn't correct for your case), so if you do have 1/X appearing somewhere inside a loop (and more than once), you can assist by caching the result inside the loop (float Y = 1.0f/X;) and then using Y. (The compiler optimization might do this in any case.)

Also, certain formulas can be redesigned to remove division or other inefficient computations. For that, you could post the larger computation being performed. Even there, the program or algorithm itself can sometimes be restructured to prevent the need for hitting time-consuming loops from being hit as frequently.

How much accuracy can be sacrificed? If on the off chance you only need an order of magnitude, you can get that easily using the modulus operator or bitwise operations.

However, in general, there's no way to speed up division. If there were, compilers would already be doing it.

like image 3
Dan Nissenbaum Avatar answered Nov 01 '22 15:11

Dan Nissenbaum


I’ve bench tested these methods on Arduino NANO for speed and 'accuracy'.
Basic calculation was to setup variables, Y = 15,000,000 and Z = 65,535
(in my real case, Y is a constant and Z can vary between 65353 and 3000 so a useful test)
Calc time on Arduino was established by putting pin low, then high as calc made and then low again and comparing on times with logic analyser. FOR 100 CYCLES. With variables as unsigned integers:-

Y * Z takes 0.231 msec
Y / Z takes  3.867 msec.  
With variables as floats:-  
Y * Z takes  1.066 msec
Y / Z takes  4.113 msec.  
Basic Bench Mark  and ( 15,000,000/65535 = 228.885 via calculator.) 

Using {Jack G’s} float reciprocal algorithm:

Y * reciprocal(Z)  takes  1.937msec  which is a good improvement, but accuracy less so 213.68.  

Using {nimig18’s} float inv_fast:

Y* inv_fast(Z)  takes  5.501 msec  accuracy 228.116  with single iteration  
Y* inv_fast(Z)  takes  7.895 msec  accuracy 228.883  with second iteration 

Using Wikipedia's Q_rsqrt (pointed to by {Jack G})

Y * Q*rsqrt(Z) takes  6.104 msec  accuracy   228.116  with single iteration  
All entertaining but ultimately disappointing!
like image 3
NT4Boy Avatar answered Nov 01 '22 16:11

NT4Boy