1,000,000,000 calculations per microsecond?

Question

OK, I've been talking to a friend about compilers and optimisation of programs, and he suggested that n * 0.5 is faster than n / 2. I said that compilers do that kind of optimisation automatically, so I wrote a small program to see if there was a difference between n / 2 and n * 0.5:

Division:

#include <stdio.h>
#include <time.h>

int main(int argc, const char * argv[]) {
    int i, m;
    float n, s;
    clock_t t;

    m = 1000000000;
    t = clock();
    for(i = 0; i < m; i++) {
        n = i / 2;
    }
    s = (float)(clock() - t) / CLOCKS_PER_SEC;

    printf("n = i / 2: %d calculations took %f seconds (last calculation = %f)
", m, s, n);

    return 0;
}

Multiplication:

#include <stdio.h>
#include <time.h>

int main(int argc, const char * argv[]) {
    int i, m;
    float n, s;
    clock_t t;

    m = 1000000000;
    t = clock();
    for(i = 0; i < m; i++) {
        n = i * 0.5;
    }
    s = (float)(clock() - t) / CLOCKS_PER_SEC;

    printf("n = i * 0.5: %d calculations took %f seconds (last calculation = %f)
", m, s, n);

    return 0;
}

And for both versions I got 0.000002s avg. when compiled with clang main.c -O1. And he said there must be something wrong with the time measurement. So he then wrote a program:

#include <cstdio>
#include <iostream>
#include <ctime>

using namespace std;

int main()
{
    clock_t ts, te;
    double  dT;

    int i, m;
    double n, o, p, q, r, s;
    m = 1000000000;

    cout << "Independent calculations:
";
    ts = clock();
    for (i = 0; i < m; i++)
    {
        //  make it a trivial pure float calculation with no int casting to float
        n = 11.1 / 2.3;
        o = 22.2 / 2.3;
        p = 33.3 / 2.3;
        q = 44.4 / 2.3;
        r = 55.5 / 2.3;
        s = 66.6 / 2.3;
    }

    te = clock();
    dT = ((float)(te - ts)) / CLOCKS_PER_SEC;   //  make initial call to get the elapsed time to run the loop
    ts = clock();

    printf("Division: %d calculations took %f seconds
", m, dT);

    for (i = 0; i < m; i++)
    {
        //  make it a trivial pure float calculation with no int casting to float
        n = 11.1 * 0.53;
        o = 22.2 * 0.53;
        p = 33.3 * 0.53;
        q = 44.4 * 0.53;
        r = 55.5 * 0.53;
        s = 66.6 * 0.53;
    }

    te = clock();
    dT = ((float)(te - ts)) / CLOCKS_PER_SEC;   //  make initial call to get the elapsed time to run the loop
    ts = clock();

    printf("Multiplication: %d calculations took %f seconds
", m, dT);

    cout << "
Dependent calculations:
";
    for (i = 0; i < m; i++)
    {
        //  make it a trivial pure float calculation with no int casting to float
        n = 11.1 / 2.3;
        o = n / 2.3;
        p = o / 2.3;
        q = p / 2.3;
        r = q / 2.3;
        s = r / 2.3;
    }


    te = clock();
    dT = ((float)(te - ts)) / CLOCKS_PER_SEC;   //  make initial call to get the elapsed time to run the loop
    ts = clock();

    printf("Division: %d calculations took %f seconds
", m, dT);

    for (i = 0; i < m; i++)
    {
        //  make it a trivial pure float calculation with no int casting to float
        n = 11.1 * 0.53;
        o = n * 0.53;
        p = o * 0.53;
        q = p * 0.53;
        r = q * 0.53;
        s = r * 0.53;
    }


    te = clock();
    dT = ((float)(te - ts)) / CLOCKS_PER_SEC;   //  make initial call to get the elapsed time to run the loop
    ts = clock();

    printf("Multiplication: %d calculations took %f seconds
", m, dT);

    return 0;
}

And for that he got...

...in that order.

So I ran the program on my machine compiled with clang++ main.cpp -O1 and I got similar results as before: 0.000002 to 0.000011.

However, when I compiled the program without optimisation, I got similar results to him on his first test. So my question is, how can any amount of optimisation make the program that much faster?

wolfPack88 · Accepted Answer

Since the code doesn't do anything differently in each iteration of the loop, the compiler is free to move the code within the loop outside (the result will be exactly the same), and remove the loop entirely, leaving you with an almost 0 run-time, as you saw.

davbryn · Answer

for (i = 0; i < m; i++)
{
    //  make it a trivial pure float calculation with no int casting to float
    n = 11.1 * 0.53;
    o = n * 0.53;
    p = o * 0.53;
    q = p * 0.53;
    r = q * 0.53;
    s = r * 0.53;
}

is a loop that does not reference i or m and contains no circular references so it is trivial for the compiler to remove the looping statement

harold · Answer

This is a good example of how benchmarking a high level language is even harder than benchmarking assembly (which is hard enough to get right already). The compiler can, and often will, interfere with your benchmark.

Your friend has a point, a division (actual division, not just writing / in C) is slower than a multiplication. For doubles, the ratio is about 4 for the latency, and division isn't pipelined whereas multiplication is, so the throughput ratio is much worse: around 20. (these numbers are for Haswell, but are typical)

Integer division is slower than floating point division, but using floating point multiplication on integer causes two conversions. The conversions aren't too bad, so in total, the floating point multiplication is still faster.

But any proper compiler will turn integer division by a constant into integer multiplication and a shift, and maybe some extra fix-up stuff (depending on the divisor and the type). Division by a power of two is even simpler. For details, see Division by Invariant Integers using Multiplication. As an example, consider

int div2(int i)
{
    return i / 2;
}

GCC turns this into

mov eax, edi
shr eax, 31
add eax, edi
sar eax
ret

Which depending on the µarch, would take 3 or 4 cycles (excluding control flow).

On the other hand,

int div2(int i)
{
    return i * 0.5;
}

Is turned into

    cvtsi2sd    xmm0, edi
    mulsd   xmm0, QWORD PTR .LC0[rip]
    cvttsd2si   eax, xmm0
    ret
.LC0:
    .long   0
    .long   1071644672

Which would take about 4+5+4 = 13 cycles.

A compiler is also allowed to turn f / 2.0 into f * 0.5 even without any "unsafe optimizations", division by a power of two is equivalent to multiplication by its inverse. That does not hold for numbers that are not a power of two.

So even with a benchmark that at least measured something, it probably wouldn't have measured the right thing. In order to measure the latency floating point division vs multiplication, you could write something like:

.section data
align 16
one: dq 1.0, 1.0
.section text
_bench1:
  mov ecx, -10000000
  movapd xmm0, [one]
loopone:
  mulpd xmm0, xmm0
  mulpd xmm0, xmm0
  add ecx, 1
  jnz loopone
  ret
_bench2:
  mov ecx, -10000000
  movapd xmm0, [one]
looptwo:
  divpd xmm0, xmm0
  divpd xmm0, xmm0
  add ecx, 1
  jnz looptwo
  ret

Call both a thousand, wrapped in rdtsc to get the time. Take the lowest time for both. Multiply the time by the ratio between your base clock and turbo clock. Then you should have (approximately) the number of cycles both loops take, divide by 20000000 to get the number of cycles per mulpd or divpd. The time division takes depends on the values being divided, so it doesn't give the most general result.

You may also be interested in a list of instruction latencies and throughputs.

1,000,000,000 calculations per microsecond?

Tags:

c++

c

optimization

compiler-optimization

thephpdev

3 Answers

wolfPack88

davbryn

harold

Recent Activity

Donate For Us

1,000,000,000 calculations per microsecond?

Tags:

c++

c

optimization

compiler-optimization

thephpdev

3 Answers

wolfPack88

davbryn

harold

Related questions

Recent Activity

Donate For Us