Fastest pow() replacement via modified exp. by squaring when lower powers are already calculated

Q: Is *x * x * x faster than *pow (x)?

Even though it's still faster to use x * x * x than std::pow (x, 3), the difference is only around 2.5 times slower. I've tested several version of G++ (4.9.4, 5.4.0 and 6.4.0) and I've not seen any significant difference in performance.

Q: Does exponentiation by squaring still minimize the Hamming weight?

; it still minimizes the Hamming weight. Exponentiation by squaring can be viewed as a suboptimal addition-chain exponentiation algorithm: it computes the exponent by an addition chain consisting of repeated exponent doublings (squarings) and/or incrementing exponents by one (multiplying by x) only.

Q: When should I use the pow function?

Overall, if you do not care much about extreme accuracy, you may consider using you own pow function for small-ish (integer) n values. After n=100, it becomes more interesting to use std::pow.

Q: Is there any overhead of using std::POW for single precision?

Here there is no overhead of using std::pow compared to direct multiplication ( x * x ). It seems that most of the overhead of this function for single precision was in fact in conversion to double since it seems that the algorithm itself is only implemented for double precision. Let's see about third power now:

Tags:

c++

optimization

math

pow

EDIT:

Goal :
Generate a ubiquitous method for deriving a custom power function that outperforms the built-in pow(double, uint) by reusing precalculated/cached powers from power calculations on common variables.

What's already been done:
I've already derived such a function that's roughly 40% faster than the built-in, however this is a brute-force hand-derived function -- I want a method for autogenerating such a power function block for an arbitrary uint power.

KNOWNS

To derive an optimal custom pow(double, uint) you need some knowns. For this question the knowns (to clarify) are:

The power will be an integer.
The maximum the power can be is known (N_MAX).
The precalculated powers that can be (re)used are known at compile time (e.g. in my example r2, r4, and r6).
The square r2 can be assumed to always have been calculated regardless of the other precalculated powers.

SOLUTION REQUIREMENTS

An optimal solution requiring a separate program to write a case lookup table or preprocessor logic to generate such a table is acceptable, however, non-optimal solutions using hand-generated (i.e. brute force derived) lookup tables using the powers on hand will not be accepted (as I have that already, and show that in my example... the idea is to get away from this).

POSSIBLE SOLUTION ROUTE

As a suggestion, you know N_MAX and a set of powers that are precalculated B (B={2,4,6} for my example). You can produce either in a separate program or in the preprocessor a table of all squares of Sq(Bi, x) <= N_MAX. You can use this to form a basis setA, which you then search somehow to determine the least number of terms that can be summed to produce an arbitrary exponent ofn>>1, wheren<=N_MAX` (the shift is due to that we take care of the odd case by checking the LSB and multiplying by the sqrt(r2)).

THEORETICAL BACKGROUND

I believe formally the below method is a modified version of exponentations by squaring:

http://en.wikipedia.org/wiki/Exponentiation_by_squaring

....which takes advantage of the fact that certain lower order powers are already by necessity precalculated, hence it shifts the optimal set of multiplications from a vanilla exponentation by squaring (which I assume pow(double, int) uses).

However there are significant savings by using the stored small power intermediates instead of simple exp. by squares on the r2.

THEORETICAL PERFORMANCE

For example, for one set of objects n=14.... in this scenario exp. by powers gives

double r4 = Sq(r2), r14=Sq(r4)*r4*r2; //4 op.

... which takes 4 FP multiplications..... but using the r2 and r6 we have

double r14=Sq(r6)*r2; //2 op.

.... 2 FP multiplications.... in other words, by going from "dumb" exponentation by squares to my modified exp. by squares using the common exponent precaching, I've cut my cost of calculations for 50% in terms of multiplications ... at least until memory costs are considered.

REAL PERFORMANCE

With my current method (compiled with gcc -O3) I get 35.1 sec. to run 1 million cycles of my program, versus (w/ no other modifications) 56.6 s using the built int pow(double, int).... so almost the theoretical speedup.

At this point you may be scratching your head at how a 50% cut in multiplications on a single instruction line can deliver a ~40% speedup. But basically this line of code is called 1,000+ times per cycle and is by far the most evaluated/most expensive line of code in the entire program. Hence the program appears highly sensitive to a small optimization/improvement in this chunk.

ORIGINAL POST and EXAMPLE CODE

I need to replace the pow(double, int) function as I already have calculated a 6th power term and have 2nd, 4th power intermediates saved, all of which can be used to reduce multiplications in the second pow call, which uses the same double base.

More specifically, in my c++ code I have a performance critical calculation snippet of code where I raise the reciprocal of the distance between 3D points to the 6th power and nth power. e.g.:

double distSq = CalcDist(p1,p2), r2 = a/distSq, r6 = r2 * r2 * r2;
results += m*(pow(sqrt(r2), n) - r6);

Where m and a are constants related to the fitted equation and n is the arbitrary power.

A slightly more efficient form is:

double distSq = CalcDist(p1,p2), r2 = a/distSq, r6 = r2 * r2 * r2;
results += m*(pow(r2, n)*(n&0x1?sqrt(r2):1.0) - r6);

However, this is also not optimal. What I've found to be significantly faster is to have a custom pow function that uses the multiples r2, r4, and r6, which I have to calculate already anyways for the second term.

e.g.:

double distSq = CalcDist(p1,p2), r2 = a/distSq, r4 = r2 * r2, r6 = r4 * r2;
results += m*(POW(r2, r4, r6 n) - r6);

Inside the function:

double POW(double r2, double r4, double r6, uint n)
{
   double results = (n&0x1 : sqrt(r2) : 1.0);
   n >>= 1;
   switch (n)
   {
     case 1:
     ....
     case 12:
        Sq(Sq(r6));

   }
   return result;
}

The good thing is that my function appears fast in preliminary testing. The bad news is that it's not very ubiquitous and is very long as I need case statements for int powers from 8 to 50 or so (potentially even higher in the future). Further each case I had to examine and try different combinations to find by brute force derivation which combination of r2, r4, and r6 yielded the least multiplications

Does anyone have a more ubiquitous solution for a pow(double, int) replacement that uses precalculated powers of the base to cut the number of necessary multiplications, and/or have a a ubiquitous theory of how you can determine the ideal combination to produce the least multiplications for an arbitrary n and some set of precalculated multiples??

261

asked Sep 22 '13 15:09

Jason R. Mick

1 Answers

Here's a somewhat DP-like algorithm that will give you the minimum number of multiplications for a given n and available powers x^i, as well as the optimal strategies via backtracking. To each possible exponent n, associate a pair (minimum number of multiplications to get here, type of multiplication that gets you there) where for the second number, simply write i or a special symbol S for squaring.

You obviously start at 1 -> (0, /).

Given n -> (m_n, Action_m), set n+i -> to (m_n + 1, i) if m_n + 1 is smaller than a possibly previously computed minimum number of moves to n+i. Similarly, set 2n -> (m_n + 1, S)if this is better than a possible previous solution.

This algorithm gives you optimal strategies in roughly O(n_max * #available powers) . I don't claim that the algorithm itself is optimally efficient though, it certainly makes no sense to use this 'on the fly'. It's only useful if you have a reasonable n_max (100, in your case, is certainly okay) and an efficient way to store the strategies.

Two thoughts to consider:

(1) Until this is benchmarked, I'm not convinced it will result in a great performance improvement over standard exp by squaring (heavily dependent on the available powers, of course).

(2) The numerical error behaviour of such strategies (as well as exp by squaring) is completely different from pow(double, double).

answered Oct 15 '22 11:10

us2012

Related questions
                            
                                find sum of maximum possible triangular chord
                            
                                Why would you use Bitwise operators? [duplicate]
                            
                                visual c++ 2010 express auto writing brackets
                            
                                Mongo C++ Client Library Doesn't See My Boost
                            
                                What will the actual source code this template creates look like?
                            
                                Artificial intelligence libraries [closed]
                            
                                How to get the points around a voronoi cell?
                            
                                Creating a stitched scene using glFrustum
                            
                                PPL Container performance
                            
                                RAII object for restoring previous value
                            
                                How difficult is source-level detection of arrays of runtime bounds?
                            
                                Unit Tests fail with exception code c0000005
                            
                                How to build a visual studio 2010 solution into a standalone dll?
                            
                                What are ways to compress directory into single file
                            
                                QTableWidget resize column both on content and stretch
                            
                                OpenCV Matrix memory release after imread command
                            
                                How to demonstrate the impact of instruction cache limitations
                            
                                std::locale::facet::_S_create_c_locale name not valid
                            
                                Applying a type transformation to a list: is "const" special?
                            
                                System( "pause" ) Causes Heap Corruption

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With