OK, I've been talking to a friend about compilers and optimisation of programs, and he suggested that n * 0.5
is faster than n / 2
. I said that compilers do that kind of optimisation automatically, so I wrote a small program to see if there was a difference between n / 2
and n * 0.5
:
Division:
#include <stdio.h>
#include <time.h>
int main(int argc, const char * argv[]) {
int i, m;
float n, s;
clock_t t;
m = 1000000000;
t = clock();
for(i = 0; i < m; i++) {
n = i / 2;
}
s = (float)(clock() - t) / CLOCKS_PER_SEC;
printf("n = i / 2: %d calculations took %f seconds (last calculation = %f)\n", m, s, n);
return 0;
}
Multiplication:
#include <stdio.h>
#include <time.h>
int main(int argc, const char * argv[]) {
int i, m;
float n, s;
clock_t t;
m = 1000000000;
t = clock();
for(i = 0; i < m; i++) {
n = i * 0.5;
}
s = (float)(clock() - t) / CLOCKS_PER_SEC;
printf("n = i * 0.5: %d calculations took %f seconds (last calculation = %f)\n", m, s, n);
return 0;
}
And for both versions I got 0.000002s avg. when compiled with clang main.c -O1
. And he said there must be something wrong with the time measurement. So he then wrote a program:
#include <cstdio>
#include <iostream>
#include <ctime>
using namespace std;
int main()
{
clock_t ts, te;
double dT;
int i, m;
double n, o, p, q, r, s;
m = 1000000000;
cout << "Independent calculations:\n";
ts = clock();
for (i = 0; i < m; i++)
{
// make it a trivial pure float calculation with no int casting to float
n = 11.1 / 2.3;
o = 22.2 / 2.3;
p = 33.3 / 2.3;
q = 44.4 / 2.3;
r = 55.5 / 2.3;
s = 66.6 / 2.3;
}
te = clock();
dT = ((float)(te - ts)) / CLOCKS_PER_SEC; // make initial call to get the elapsed time to run the loop
ts = clock();
printf("Division: %d calculations took %f seconds\n", m, dT);
for (i = 0; i < m; i++)
{
// make it a trivial pure float calculation with no int casting to float
n = 11.1 * 0.53;
o = 22.2 * 0.53;
p = 33.3 * 0.53;
q = 44.4 * 0.53;
r = 55.5 * 0.53;
s = 66.6 * 0.53;
}
te = clock();
dT = ((float)(te - ts)) / CLOCKS_PER_SEC; // make initial call to get the elapsed time to run the loop
ts = clock();
printf("Multiplication: %d calculations took %f seconds\n", m, dT);
cout << "\nDependent calculations:\n";
for (i = 0; i < m; i++)
{
// make it a trivial pure float calculation with no int casting to float
n = 11.1 / 2.3;
o = n / 2.3;
p = o / 2.3;
q = p / 2.3;
r = q / 2.3;
s = r / 2.3;
}
te = clock();
dT = ((float)(te - ts)) / CLOCKS_PER_SEC; // make initial call to get the elapsed time to run the loop
ts = clock();
printf("Division: %d calculations took %f seconds\n", m, dT);
for (i = 0; i < m; i++)
{
// make it a trivial pure float calculation with no int casting to float
n = 11.1 * 0.53;
o = n * 0.53;
p = o * 0.53;
q = p * 0.53;
r = q * 0.53;
s = r * 0.53;
}
te = clock();
dT = ((float)(te - ts)) / CLOCKS_PER_SEC; // make initial call to get the elapsed time to run the loop
ts = clock();
printf("Multiplication: %d calculations took %f seconds\n", m, dT);
return 0;
}
And for that he got...
1.869570s
1.868254s
25.674016s
3.497555s
...in that order.
So I ran the program on my machine compiled with clang++ main.cpp -O1
and I got similar results as before: 0.000002 to 0.000011
.
However, when I compiled the program without optimisation, I got similar results to him on his first test. So my question is, how can any amount of optimisation make the program that much faster?
Since the code doesn't do anything differently in each iteration of the loop, the compiler is free to move the code within the loop outside (the result will be exactly the same), and remove the loop entirely, leaving you with an almost 0 run-time, as you saw.
for (i = 0; i < m; i++)
{
// make it a trivial pure float calculation with no int casting to float
n = 11.1 * 0.53;
o = n * 0.53;
p = o * 0.53;
q = p * 0.53;
r = q * 0.53;
s = r * 0.53;
}
is a loop that does not reference i
or m
and contains no circular references so it is trivial for the compiler to remove the looping statement
This is a good example of how benchmarking a high level language is even harder than benchmarking assembly (which is hard enough to get right already). The compiler can, and often will, interfere with your benchmark.
Your friend has a point, a division (actual division, not just writing /
in C) is slower than a multiplication. For doubles, the ratio is about 4 for the latency, and division isn't pipelined whereas multiplication is, so the throughput ratio is much worse: around 20. (these numbers are for Haswell, but are typical)
Integer division is slower than floating point division, but using floating point multiplication on integer causes two conversions. The conversions aren't too bad, so in total, the floating point multiplication is still faster.
But any proper compiler will turn integer division by a constant into integer multiplication and a shift, and maybe some extra fix-up stuff (depending on the divisor and the type). Division by a power of two is even simpler. For details, see Division by Invariant Integers using Multiplication. As an example, consider
int div2(int i)
{
return i / 2;
}
GCC turns this into
mov eax, edi
shr eax, 31
add eax, edi
sar eax
ret
Which depending on the µarch, would take 3 or 4 cycles (excluding control flow).
On the other hand,
int div2(int i)
{
return i * 0.5;
}
Is turned into
cvtsi2sd xmm0, edi
mulsd xmm0, QWORD PTR .LC0[rip]
cvttsd2si eax, xmm0
ret
.LC0:
.long 0
.long 1071644672
Which would take about 4+5+4 = 13 cycles.
A compiler is also allowed to turn f / 2.0
into f * 0.5
even without any "unsafe optimizations", division by a power of two is equivalent to multiplication by its inverse. That does not hold for numbers that are not a power of two.
So even with a benchmark that at least measured something, it probably wouldn't have measured the right thing. In order to measure the latency floating point division vs multiplication, you could write something like:
.section data
align 16
one: dq 1.0, 1.0
.section text
_bench1:
mov ecx, -10000000
movapd xmm0, [one]
loopone:
mulpd xmm0, xmm0
mulpd xmm0, xmm0
add ecx, 1
jnz loopone
ret
_bench2:
mov ecx, -10000000
movapd xmm0, [one]
looptwo:
divpd xmm0, xmm0
divpd xmm0, xmm0
add ecx, 1
jnz looptwo
ret
Call both a thousand, wrapped in rdtsc
to get the time. Take the lowest time for both. Multiply the time by the ratio between your base clock and turbo clock. Then you should have (approximately) the number of cycles both loops take, divide by 20000000 to get the number of cycles per mulpd
or divpd
. The time division takes depends on the values being divided, so it doesn't give the most general result.
You may also be interested in a list of instruction latencies and throughputs.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With