Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why is pow(int, int) so slow?

I've been working on a few project Euler exercises to improve my knowledge of C++.

I've written the following function:

int a = 0,b = 0,c = 0;  for (a = 1; a <= SUMTOTAL; a++) {     for (b = a+1; b <= SUMTOTAL-a; b++)     {         c = SUMTOTAL-(a+b);          if (c == sqrt(pow(a,2)+pow(b,2)) && b < c)         {             std::cout << "a: " << a << " b: " << b << " c: "<< c << std::endl;             std::cout << a * b * c << std::endl;         }     } } 

This computes in 17 milliseconds.

However, if I change the line

if (c == sqrt(pow(a,2)+pow(b,2)) && b < c) 

to

if (c == sqrt((a*a)+(b*b)) && b < c) 

the computation takes place in 2 milliseconds. Is there some obvious implementation detail of pow(int, int) that I'm missing which makes the first expression compute so much slower?

like image 403
Fang Avatar asked Dec 10 '16 06:12

Fang


2 Answers

pow() works with real floating-point numbers and uses under the hood the formula

pow(x,y) = e^(y log(x)) 

to calculate x^y. The int are converted to double before calling pow. (log is the natural logarithm, e-based)

x^2 using pow() is therefore slower than x*x.

Edit based on relevant comments

  • Using pow even with integer exponents may yield incorrect results (PaulMcKenzie)
  • In addition to using a math function with double type, pow is a function call (while x*x isn't) (jtbandes)
  • Many modern compilers will in fact optimize out pow with constant integer arguments, but this should not be relied upon.
like image 200
Déjà vu Avatar answered Sep 23 '22 23:09

Déjà vu


You've picked one of the slowest possible ways to check

c*c == a*a + b*b   // assuming c is non-negative 

That compiles to three integer multiplications (one of which can be hoisted out of the loop). Even without pow(), you're still converting to double and taking a square root, which is terrible for throughput. (And also latency, but branch prediction + speculative execution on modern CPUs means that latency isn't a factor here).

Intel Haswell's SQRTSD instruction has a throughput of one per 8-14 cycles (source: Agner Fog's instruction tables), so even if your sqrt() version keeps the FP sqrt execution unit saturated, it's still about 4 times slower than what I got gcc to emit (below).


You can also optimize the loop condition to break out of the loop when the b < c part of the condition becomes false, so the compiler only has to do one version of that check.

void foo_optimized() {    for (int a = 1; a <= SUMTOTAL; a++) {     for (int b = a+1; b < SUMTOTAL-a-b; b++) {         // int c = SUMTOTAL-(a+b);   // gcc won't always transform signed-integer math, so this prevents hoisting (SUMTOTAL-a) :(         int c = (SUMTOTAL-a) - b;         // if (b >= c) break;  // just changed the loop condition instead          // the compiler can hoist a*a out of the loop for us         if (/* b < c && */ c*c == a*a + b*b) {             // Just print a newline.  std::endl also flushes, which bloats the asm             std::cout << "a: " << a << " b: " << b << " c: "<< c << '\n';             std::cout << a * b * c << '\n';         }     }   } } 

This compiles (with gcc6.2 -O3 -mtune=haswell) to code with this inner loop. See the full code on the Godbolt compiler explorer.

# a*a is hoisted out of the loop.  It's in r15d .L6:     add     ebp, 1    # b++     sub     ebx, 1    # c--     add     r12d, r14d        # ivtmp.36, ivtmp.43  # not sure what this is or why it's in the loop, would have to look again at the asm outside     cmp     ebp, ebx  # b, _39     jg      .L13    ## This is the loop-exit branch, not-taken until the end                     ## .L13 is the rest of the outer loop.                     ##  It sets up for the next entry to this inner loop. .L8:     mov     eax, ebp        # multiply a copy of the counters     mov     edx, ebx     imul    eax, ebp        # b*b     imul    edx, ebx        # c*c     add     eax, r15d       # a*a + b*b     cmp     edx, eax  # tmp137, tmp139     jne     .L6  ## Fall-through into the cout print code when we find a match  ## extremely rare, so should predict near-perfectly 

On Intel Haswell, all these instructions are 1 uop each. (And the cmp/jcc pairs macro-fuse into compare-and-branch uops.) So that's 10 fused-domain uops, which can issue at one iteration per 2.5 cycles.

Haswell runs imul r32, r32 with a throughput of one iteration per clock, so the two multiplies inside the inner loop aren't saturating port 1 at two multiplies per 2.5c. This leaves room to soak up the inevitable resource conflicts from ADD and SUB stealing port 1.

We're not even close to any other execution-port bottlenecks, so the front-end bottleneck is the only issue, and this should run at one iteration per 2.5 cycles on Intel Haswell and later.

Loop-unrolling could help here to reduce the number of uops per check. e.g. use lea ecx, [rbx+1] to compute b+1 for the next iteration, so we can imul ebx, ebx without using a MOV to make it non-destructive.


A strength-reduction is also possible: Given b*b we could try to compute (b-1) * (b-1) without an IMUL. (b-1) * (b-1) = b*b - 2*b + 1, so maybe we can do an lea ecx, [rbx*2 - 1] and then subtract that from b*b. (There are no addressing-modes that subtract instead of add. Hmm, maybe we could keep -b in a register, and count up towards zero, so we could use lea ecx, [rcx + rbx*2 - 1] to update b*b in ECX, given -b in EBX).

Unless you actually bottleneck on IMUL throughput, this might end up taking more uops and not be a win. It might be fun to see how well a compiler would do with this strength-reduction in the C++ source.


You could probably also vectorize this with SSE or AVX, checking 4 or 8 consecutive b values in parallel. Since hits are really rare, you just check if any of the 8 had a hit and then sort out which one it was in the rare case that there was a match.

See also the x86 tag wiki for more optimization stuff.

like image 40
Peter Cordes Avatar answered Sep 24 '22 23:09

Peter Cordes