Performance of pow(x,3.0f) vs x*x*x?

Question

The following program...

int main() {
    float t = 0;
    for (int i = 0; i < 1'000'000'000; i++) {
        const float x = i;
        t += x*x*x;
    }
    return t;
}

...takes about 900ms to complete on my machine. Whereas...

#include <cmath>

int main() {
    float t = 0;
    for (int i = 0; i < 1'000'000'000; i++) {
        const float x = i;
        t += std::pow(x,3.0f);
    }
    return t;
}

...takes about 6600ms to complete.

I'm kind of suprised that the optimizer doesn't inline the std::pow function so that the two programs produce the same code and have identical performance.

Any insights? How do you account for the 5x performance difference?

For reference I'm using gcc -O3 on Linux x86

Update: (C Version)

int main() {
    float t = 0;
    for (int i = 0; i < 1000000000; i++) {
        const float x = i;
        t += x*x*x;
    }
    return t;
}

...takes about 900ms to complete on my machine. Whereas...

#include <math.h>

int main() {
    float t = 0;
    for (int i = 0; i < 1000000000; i++) {
        const float x = i;
        t += powf(x,3.0f);
    }
    return t;
}

...takes about 6600ms to complete.

Update 2

The following program:

#include <math.h>

int main() {
    float t = 0;
    for (int i = 0; i < 1000000000; i++) {
        const float x = i;
        t += __builtin_powif(x,3.0f);
    }
    return t;
}

runs in 900ms like the first program.

Why isn't pow being inlined to __builtin_powif ?

Update 3:

With -ffast-math the following program:

#include <math.h>
#include <iostream>

int main() {
    float t = 0;
    for (int i = 0; i < 1'000'000'000; i++) {
            const float x = i;
            t += powf(x, 3.0f);
    }
    std::cout << t;
}

runs in 227ms (as does the x*x*x version). That's 200 picoseconds per iteration. Using -fopt-info it says optimized: loop vectorized using 16 byte vectors and optimized: loop with 2 iterations completely unrolled so I guess that means its doing iterations in batches of 4 for SSE and doing 2 iterations at once pipelining (for a total of 8 iterations at once), or something like that?

Serge Ballesta · Accepted Answer

The doc page about gcc builtins is explicit (emphasize mine):

Built-in Function: double __builtin_powi (double, int)

Returns the first argument raised to the power of the second. Unlike the pow function no guarantees about precision and rounding are made.

Built-in Function: float __builtin_powif (float, int)

Similar to __builtin_powi, except the argument and return types are float.

As __builtin_powif has equivalent performances to a a mere product, it means that the additional time is used to the controls required by pow for its guarantees about precision and rounding.

Performance of pow(x,3.0f) vs xxx?

Tags:

c++

c

gcc

Andrew Tomazos

1 Answers

Serge Ballesta

Recent Activity

Donate For Us