EDIT:
Goal :
Generate a ubiquitous method for deriving a custom power function that outperforms the built-in pow(double, uint)
by reusing precalculated/cached powers from power calculations on common variables.
What's already been done:
I've already derived such a function that's roughly 40% faster than the built-in, however this is a brute-force hand-derived function -- I want a method for autogenerating such a power function block for an arbitrary uint
power.
KNOWNS
To derive an optimal custom pow(double, uint)
you need some knowns. For this question the knowns (to clarify) are:
N_MAX
).r2
, r4
, and r6
).r2
can be assumed to always have been calculated regardless
of the other precalculated powers.SOLUTION REQUIREMENTS
An optimal solution requiring a separate program to write a case
lookup table or preprocessor logic to generate such a table is acceptable, however, non-optimal solutions using hand-generated (i.e. brute force derived) lookup tables using the powers on hand will not be accepted (as I have that already, and show that in my example... the idea is to get away from this).
POSSIBLE SOLUTION ROUTE
As a suggestion, you know N_MAX
and a set of powers that are precalculated B
(B={2,4,6}
for my example). You can produce either in a separate program or in the preprocessor a table of all squares of Sq(Bi, x
) <= N_MAX. You can use this to form a basis set
A, which you then search somehow to determine the least number of terms that can be summed to produce an arbitrary exponent of
n>>1, where
n<=N_MAX` (the shift is due to that we take care of the odd case by checking the LSB and multiplying by the sqrt(r2)).
THEORETICAL BACKGROUND
I believe formally the below method is a modified version of exponentations by squaring:
http://en.wikipedia.org/wiki/Exponentiation_by_squaring
....which takes advantage of the fact that certain lower order powers are already by necessity precalculated, hence it shifts the optimal set of multiplications from a vanilla exponentation by squaring (which I assume pow(double, int)
uses).
However there are significant savings by using the stored small power intermediates instead of simple exp. by squares on the r2
.
THEORETICAL PERFORMANCE
For example, for one set of objects n=14
.... in this scenario exp. by powers gives
double r4 = Sq(r2), r14=Sq(r4)*r4*r2; //4 op.
... which takes 4 FP multiplications..... but using the r2
and r6
we have
double r14=Sq(r6)*r2; //2 op.
.... 2 FP multiplications.... in other words, by going from "dumb" exponentation by squares to my modified exp. by squares using the common exponent precaching, I've cut my cost of calculations for 50% in terms of multiplications ... at least until memory costs are considered.
REAL PERFORMANCE
With my current method (compiled with gcc -O3
) I get 35.1 sec. to run 1 million cycles of my program, versus (w/ no other modifications) 56.6 s using the built int pow(double, int)
.... so almost the theoretical speedup.
At this point you may be scratching your head at how a 50% cut in multiplications on a single instruction line can deliver a ~40% speedup. But basically this line of code is called 1,000+ times per cycle and is by far the most evaluated/most expensive line of code in the entire program. Hence the program appears highly sensitive to a small optimization/improvement in this chunk.
ORIGINAL POST and EXAMPLE CODE
I need to replace the pow(double, int)
function as I already have calculated a 6th power term and have 2nd, 4th power intermediates saved, all of which can be used to reduce multiplications in the second pow
call, which uses the same double
base.
More specifically, in my c++ code I have a performance critical calculation snippet of code where I raise the reciprocal of the distance between 3D points to the 6th power and nth power. e.g.:
double distSq = CalcDist(p1,p2), r2 = a/distSq, r6 = r2 * r2 * r2;
results += m*(pow(sqrt(r2), n) - r6);
Where m
and a
are constants related to the fitted equation and n
is the arbitrary power.
A slightly more efficient form is:
double distSq = CalcDist(p1,p2), r2 = a/distSq, r6 = r2 * r2 * r2;
results += m*(pow(r2, n)*(n&0x1?sqrt(r2):1.0) - r6);
However, this is also not optimal. What I've found to be significantly faster is to have a custom pow
function that uses the multiples r2, r4, and r6, which I have to calculate already anyways for the second term.
e.g.:
double distSq = CalcDist(p1,p2), r2 = a/distSq, r4 = r2 * r2, r6 = r4 * r2;
results += m*(POW(r2, r4, r6 n) - r6);
Inside the function:
double POW(double r2, double r4, double r6, uint n)
{
double results = (n&0x1 : sqrt(r2) : 1.0);
n >>= 1;
switch (n)
{
case 1:
....
case 12:
Sq(Sq(r6));
}
return result;
}
The good thing is that my function appears fast in preliminary testing. The bad news is that it's not very ubiquitous and is very long as I need case
statements for int
powers from 8
to 50
or so (potentially even higher in the future). Further each case I had to examine and try different combinations to find by brute force derivation which combination of r2
, r4
, and r6
yielded the least multiplications
Does anyone have a more ubiquitous solution for a pow(double, int)
replacement that uses precalculated powers of the base to cut the number of necessary multiplications, and/or have a a ubiquitous theory of how you can determine the ideal combination to produce the least multiplications for an arbitrary n
and some set of precalculated multiples??
Even though it's still faster to use x * x * x than std::pow (x, 3), the difference is only around 2.5 times slower. I've tested several version of G++ (4.9.4, 5.4.0 and 6.4.0) and I've not seen any significant difference in performance.
; it still minimizes the Hamming weight. Exponentiation by squaring can be viewed as a suboptimal addition-chain exponentiation algorithm: it computes the exponent by an addition chain consisting of repeated exponent doublings (squarings) and/or incrementing exponents by one (multiplying by x) only.
Overall, if you do not care much about extreme accuracy, you may consider using you own pow function for small-ish (integer) n values. After n=100, it becomes more interesting to use std::pow.
Here there is no overhead of using std::pow compared to direct multiplication ( x * x ). It seems that most of the overhead of this function for single precision was in fact in conversion to double since it seems that the algorithm itself is only implemented for double precision. Let's see about third power now:
Here's a somewhat DP-like algorithm that will give you the minimum number of multiplications for a given n
and available powers x^i
, as well as the optimal strategies via backtracking. To each possible exponent n
, associate a pair (minimum number of multiplications to get here, type of multiplication that gets you there)
where for the second number, simply write i
or a special symbol S
for squaring.
You obviously start at 1 -> (0, /)
.
Given n -> (m_n, Action_m)
, set n+i ->
to (m_n + 1, i)
if m_n + 1
is smaller than a possibly previously computed minimum number of moves to n+i
. Similarly, set 2n -> (m_n + 1, S)
if this is better than a possible previous solution.
This algorithm gives you optimal strategies in roughly O(n_max * #available powers)
. I don't claim that the algorithm itself is optimally efficient though, it certainly makes no sense to use this 'on the fly'. It's only useful if you have a reasonable n_max
(100, in your case, is certainly okay) and an efficient way to store the strategies.
Two thoughts to consider:
(1) Until this is benchmarked, I'm not convinced it will result in a great performance improvement over standard exp by squaring (heavily dependent on the available powers, of course).
(2) The numerical error behaviour of such strategies (as well as exp by squaring) is completely different from pow(double, double)
.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With