Two very similar functions involving sin() exhibit vastly different performance -- why?

Tags:

Consider the following two programs that perform the same computations in two different ways:

// v1.c
#include <stdio.h>
#include <math.h>
int main(void) {
   int i, j;
   int nbr_values = 8192;
   int n_iter = 100000;
   float x;
   for (j = 0; j < nbr_values; j++) {
      x = 1;
      for (i = 0; i < n_iter; i++)
         x = sin(x);
   }
   printf("%f\n", x);
   return 0;
}

and

// v2.c
#include <stdio.h>
#include <math.h>
int main(void) {
   int i, j;
   int nbr_values = 8192;
   int n_iter = 100000;
   float x[nbr_values];
   for (i = 0; i < nbr_values; ++i) {
      x[i] = 1;
   }
   for (i = 0; i < n_iter; i++) {
      for (j = 0; j < nbr_values; ++j) {
         x[j] = sin(x[j]);
      }
   }
   printf("%f\n", x[0]);
   return 0;
}

When I compile them using gcc 4.7.2 with -O3 -ffast-math and run on a Sandy Bridge box, the second program is twice as fast as the first one.

Why is that?

One suspect is the data dependency between successive iterations of the i loop in v1. However, I don't quite see what the full explanation might be.

(Question inspired by Why is my python/numpy example faster than pure C implementation?)

EDIT:

Here is the generated assembly for v1:

        movl    $8192, %ebp
        pushq   %rbx
LCFI1:
        subq    $8, %rsp
LCFI2:
        .align 4
L2:
        movl    $100000, %ebx
        movss   LC0(%rip), %xmm0
        jmp     L5
        .align 4
L3:
        call    _sinf
L5:
        subl    $1, %ebx
        jne     L3
        subl    $1, %ebp
        .p2align 4,,2
        jne     L2

and for v2:

        movl    $100000, %r14d
        .align 4
L8:
        xorl    %ebx, %ebx
        .align 4
L9:
        movss   (%r12,%rbx), %xmm0
        call    _sinf
        movss   %xmm0, (%r12,%rbx)
        addq    $4, %rbx
        cmpq    $32768, %rbx
        jne     L9
        subl    $1, %r14d
        jne     L8

938

asked Jan 22 '13 21:01

NPE

1 Answers

Ignore the loop structure all together, and only think about the sequence of calls to sin. v1 does the following:

x <-- sin(x)
x <-- sin(x)
x <-- sin(x)
...

that is, each computation of sin( ) cannot begin until the result of the previous call is available; it must wait for the entirety of the previous computation. This means that for N calls to sin, the total time required is 819200000 times the latency of a single sin evaluation.

In v2, by contrast, you do the following:

x[0] <-- sin(x[0])
x[1] <-- sin(x[1])
x[2] <-- sin(x[2])
...

notice that each call to sin does not depend on the previous call. Effectively, the calls to sin are all independent, and the processor can begin on each as soon as the necessary register and ALU resources are available (without waiting for the previous computation to be completed). Thus, the time required is a function of the throughput of the sin function, not the latency, and so v2 can finish in significantly less time.

I should also note that DeadMG is right that v1 and v2 are formally equivalent, and in a perfect world the compiler would optimize both of them into a single chain of 100000 sin evaluations (or simply evaluate the result at compile time). Sadly, we live in an imperfect world.

126

answered Nov 09 '22 00:11

Stephen Canon

Related questions
                            
                                Lambda Expressions
                            
                                How can I reserve memory addresses without allocating them
                            
                                How to rearrange data in array so that two similar items are not next to each other?
                            
                                Declaration of variable causes segmentation fault
                            
                                Is it possible to prepend data to an file without rewriting?
                            
                                Loading raw code from C program
                            
                                Type casting a variable from jBoolean to bool
                            
                                Passing pointers between C and Java through JNI
                            
                                Algorithm to loop over an array from the middle outwards?
                            
                                zend custom module
                            
                                Which IDEs have good support for programming with CUDA? [closed]
                            
                                Undefined reference to fork() in Code::Blocks editor in Windows OS
                            
                                Use Python code in C/C++
                            
                                Concatenate string literal with char literal
                            
                                Haskell FFI - How to handle C functions that accept or return structs instead of pointers to structs?
                            
                                fgetc, checking EOF
                            
                                Floats vs rationals in arbitrary precision fractional arithmetic (C/C++)
                            
                                segments within a executable C program
                            
                                How to use OpenSSL to extend {D,E,N} RSA key to {D,E,N,p,q,etc.}?
                            
                                Why should I not use __fastcall instead the standard __cdecl?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Two very similar functions involving sin() exhibit vastly different performance -- why?

Tags:

performance

c

floating-point

x86

gcc

NPE

People also ask

1 Answers

Stephen Canon

Recent Activity

Donate For Us