Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Performance of pow(x,3.0f) vs x*x*x?

Tags:

c++

c

gcc

The following program...

int main() {
    float t = 0;
    for (int i = 0; i < 1'000'000'000; i++) {
        const float x = i;
        t += x*x*x;
    }
    return t;
}

...takes about 900ms to complete on my machine. Whereas...

#include <cmath>

int main() {
    float t = 0;
    for (int i = 0; i < 1'000'000'000; i++) {
        const float x = i;
        t += std::pow(x,3.0f);
    }
    return t;
}

...takes about 6600ms to complete.

I'm kind of suprised that the optimizer doesn't inline the std::pow function so that the two programs produce the same code and have identical performance.

Any insights? How do you account for the 5x performance difference?

For reference I'm using gcc -O3 on Linux x86

Update: (C Version)

int main() {
    float t = 0;
    for (int i = 0; i < 1000000000; i++) {
        const float x = i;
        t += x*x*x;
    }
    return t;
}

...takes about 900ms to complete on my machine. Whereas...

#include <math.h>

int main() {
    float t = 0;
    for (int i = 0; i < 1000000000; i++) {
        const float x = i;
        t += powf(x,3.0f);
    }
    return t;
}

...takes about 6600ms to complete.

Update 2

The following program:

#include <math.h>

int main() {
    float t = 0;
    for (int i = 0; i < 1000000000; i++) {
        const float x = i;
        t += __builtin_powif(x,3.0f);
    }
    return t;
}

runs in 900ms like the first program.

Why isn't pow being inlined to __builtin_powif ?

Update 3:

With -ffast-math the following program:

#include <math.h>
#include <iostream>

int main() {
    float t = 0;
    for (int i = 0; i < 1'000'000'000; i++) {
            const float x = i;
            t += powf(x, 3.0f);
    }
    std::cout << t;
}

runs in 227ms (as does the x*x*x version). That's 200 picoseconds per iteration. Using -fopt-info it says optimized: loop vectorized using 16 byte vectors and optimized: loop with 2 iterations completely unrolled so I guess that means its doing iterations in batches of 4 for SSE and doing 2 iterations at once pipelining (for a total of 8 iterations at once), or something like that?

like image 251
Andrew Tomazos Avatar asked Nov 25 '25 19:11

Andrew Tomazos


1 Answers

The doc page about gcc builtins is explicit (emphasize mine):

Built-in Function: double __builtin_powi (double, int)

Returns the first argument raised to the power of the second. Unlike the pow function no guarantees about precision and rounding are made.

Built-in Function: float __builtin_powif (float, int)

Similar to __builtin_powi, except the argument and return types are float.

As __builtin_powif has equivalent performances to a a mere product, it means that the additional time is used to the controls required by pow for its guarantees about precision and rounding.

like image 180
Serge Ballesta Avatar answered Nov 27 '25 08:11

Serge Ballesta



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!