Consider the following two programs that perform the same computations in two different ways:
// v1.c
#include <stdio.h>
#include <math.h>
int main(void) {
int i, j;
int nbr_values = 8192;
int n_iter = 100000;
float x;
for (j = 0; j < nbr_values; j++) {
x = 1;
for (i = 0; i < n_iter; i++)
x = sin(x);
}
printf("%f\n", x);
return 0;
}
and
// v2.c
#include <stdio.h>
#include <math.h>
int main(void) {
int i, j;
int nbr_values = 8192;
int n_iter = 100000;
float x[nbr_values];
for (i = 0; i < nbr_values; ++i) {
x[i] = 1;
}
for (i = 0; i < n_iter; i++) {
for (j = 0; j < nbr_values; ++j) {
x[j] = sin(x[j]);
}
}
printf("%f\n", x[0]);
return 0;
}
When I compile them using gcc 4.7.2 with -O3 -ffast-math
and run on a Sandy Bridge box, the second program is twice as fast as the first one.
Why is that?
One suspect is the data dependency between successive iterations of the i
loop in v1
. However, I don't quite see what the full explanation might be.
(Question inspired by Why is my python/numpy example faster than pure C implementation?)
EDIT:
Here is the generated assembly for v1
:
movl $8192, %ebp
pushq %rbx
LCFI1:
subq $8, %rsp
LCFI2:
.align 4
L2:
movl $100000, %ebx
movss LC0(%rip), %xmm0
jmp L5
.align 4
L3:
call _sinf
L5:
subl $1, %ebx
jne L3
subl $1, %ebp
.p2align 4,,2
jne L2
and for v2
:
movl $100000, %r14d
.align 4
L8:
xorl %ebx, %ebx
.align 4
L9:
movss (%r12,%rbx), %xmm0
call _sinf
movss %xmm0, (%r12,%rbx)
addq $4, %rbx
cmpq $32768, %rbx
jne L9
subl $1, %r14d
jne L8
Therefore, cosine function and sine function are identical to each other, except with the horizontal shift to the left of π/2 radians in cosine function. Due to this similarity, any cosine function can be written in terms of a sine function as cos x=sin (x+ π/2).
The derivative of sin x is denoted by d/dx (sin x) = cos x. The other way to represent the sine function is (sin x)' = cos x.
The values of the sine function are different, depending on whether the angle is in degrees or radians. The function is periodic with periodicity 360 degrees or 2π radians. We can define an inverse function, denoted f(x) = sin−1 x or f(x) = arcsin x, by restricting the domain of the sine function.
Ignore the loop structure all together, and only think about the sequence of calls to sin
. v1
does the following:
x <-- sin(x)
x <-- sin(x)
x <-- sin(x)
...
that is, each computation of sin( )
cannot begin until the result of the previous call is available; it must wait for the entirety of the previous computation. This means that for N calls to sin
, the total time required is 819200000 times the latency of a single sin
evaluation.
In v2
, by contrast, you do the following:
x[0] <-- sin(x[0])
x[1] <-- sin(x[1])
x[2] <-- sin(x[2])
...
notice that each call to sin
does not depend on the previous call. Effectively, the calls to sin
are all independent, and the processor can begin on each as soon as the necessary register and ALU resources are available (without waiting for the previous computation to be completed). Thus, the time required is a function of the throughput of the sin function, not the latency, and so v2
can finish in significantly less time.
I should also note that DeadMG is right that v1
and v2
are formally equivalent, and in a perfect world the compiler would optimize both of them into a single chain of 100000 sin
evaluations (or simply evaluate the result at compile time). Sadly, we live in an imperfect world.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With