Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can a single sqrt() runs twice slower than when it was put in a for loop

I'm doing an experiment of profiling the time it takes to compute a single sqrt in C code. I have two strategies.

One is direct measurement of a single sqrt call and the other is to execute sqrt multiple times in a for loop and then compute the average. The C code is very simple and is shown as follows:

long long readTSC(void);

int main(int argc, char** argv)
{
    int    n = atoi(argv[1]);
    //v is input of sqrt() making sure compiler won't 
    //precompute the result of sqrt(v) if v is constant
    double v = atof(argv[2]);. 
    long long tm;            //track CPU clock cycles
    double x;                //result of sqrt()
    //-- strategy I ---
    tm = readTSC();     //A function that uses rdtsc instruction to get the number of clock cycles from Intel CPU
    x  = sqrt(v);
    tm = readTSC() - tm;
    printf("x=%15.6\n",x);   //make sure compiler won't optimize out the above sqrt()
    printf("%lld clocks\n",tm);
    double sum = 0.0;
    int i;

    //-- strategy II --
    tm = readTSC();
    for ( i = 0; i < n; i++ )
        sum += sqrt((double) i);
    tm = readTSC() - tm;

    printf("%lld clocks\n",tm);
    printf("%15.6e\n",sum);
    return 0;
}

long long readTSC(void)
{
 /* read the time stamp counter on Intel x86 chips */
  union { long long complete; unsigned int part[2]; } ticks;
  __asm__ ("rdtsc; mov %%eax,%0;mov %%edx,%1"
        : "=mr" (ticks.part[0]),
          "=mr" (ticks.part[1])
        : /* no inputs */
        : "eax", "edx");
  return ticks.complete;
}

Before running the code, I expected that the timing result of strategy I might be slightly smaller than the one from strategy II, because strategy II also counts the overhead incurred by the for loop and the sum addition.

I use the following command without O3 optimization to compile my code on Intel Xeon E5-2680 2.7GHz machine.

gcc -o timing -lm timing.c

However, the result shows that strategy I takes about 40 clock cycles while strategy II takes average of 21.8 clock cycles, almost half of the former one.

For your reference, I also pasted the related assembly code below with some comments. It occurs to me, based on the timing result, that each for iteration executes two sqrt()'s. But I can hardly tell from the assembly code how the CPU can actually execute two sqrt()'s call in parallel?

call    atof
cvtsi2ss    %eax, %xmm0
movss   %xmm0, -36(%rbp)

//-- timing single sqrt ---
call    readTSC
movq    %rax, -32(%rbp)
movss   -36(%rbp), %xmm1
cvtps2pd    %xmm1, %xmm1
//--- sqrtsd instruction
sqrtsd  %xmm1, %xmm0
ucomisd %xmm0, %xmm0
jp  .L8
je  .L4
.L8:
movapd  %xmm1, %xmm0
//--- C function call sqrt()
call    sqrt
.L4:
movsd   %xmm0, -72(%rbp)
movq    -72(%rbp), %rax
movq    %rax, -24(%rbp)
call    readTSC
//-- end of timing single sqrt ---

subq    -32(%rbp), %rax
movq    %rax, -32(%rbp)
movl    $.LC0, %eax
movsd   -24(%rbp), %xmm0
movq    %rax, %rdi
movl    $1, %eax
call    printf
movl    $.LC1, %eax
movq    -32(%rbp), %rdx
movq    %rdx, %rsi
movq    %rax, %rdi
movl    $0, %eax
call    printf
movl    $0, %eax
movq    %rax, -16(%rbp)

call    readTSC
//-- start of for loop----
movq    %rax, -32(%rbp)
movl    $0, -4(%rbp)
jmp .L5
.L6:
//(double) i
cvtsi2sd    -4(%rbp), %xmm0
//-- C function call sqrt()
call    sqrt
movsd   -16(%rbp), %xmm1
//add sqrt(i) to sum (%xmm0)
addsd   %xmm1, %xmm0
movsd   %xmm0, -16(%rbp)
//i++
addl    $1, -4(%rbp)
.L5:
movl    -4(%rbp), %eax
//check i<n
cmpl    -40(%rbp), %eax
jl  .L6
//-- end of for loop--
//you can skip the rest of the part.
call    readTSC
subq    -32(%rbp), %rax
movq    %rax, -32(%rbp)
movl    $.LC1, %eax
movq    -32(%rbp), %rdx
movq    %rdx, %rsi
movq    %rax, %rdi
movl    $0, %eax
call    printf
movl    $.LC3, %eax
movsd   -16(%rbp), %xmm0
movq    %rax, %rdi
movl    $1, %eax
call    printf
like image 269
Stone Avatar asked Dec 20 '22 04:12

Stone


2 Answers

E5-2680 is a Sandy Bridge CPU and both the latency and the reciprocal throughput for SQRTSD is 10 to 21 cycles/instr. So in loop or not, you should measure something close to the observed 21.8 cycles. The sqrt function in GLIBC simply checks the sign of the argument and arranges for the non-negative branch to get executed speculatively via branch prediction, which in turn is a call to __ieee754_sqrt, which itself is simple inline assembly routine that on x86-64 systems emits sqrtsd %xmm0, %xmm0.

The CPU uses register renaming to handle data dependency. Thus it could have two copies of sqrtsd %xmm0, %xmm0 at different stages of execution in the pipeline. Since the result of sqrt is not needed immediately, other instructions could be executed while sqrt is being processed and that's why you measure only 21.8 cycles on average.

As for the larger value in the first case, RDTSC does not have the resolution of a single cycle. It has a certain latency, so you are basically measuring T_code_block + T_rdtsc_latency. In the second scenario, averaging over the iterations gives:

(T_code_block * n_iters + T_rdtsc_latency) / n_iters =
= T_code_block + (T_rdtsc_latency / n_iters)

For large n_iters, the second term vanishes and you get a very accurate measurement of a single iteration.


One has to be very careful when benchmarking with RDTSC. TSC itself ticks on modern CPUs at the reference clock speed. If the loop runs for long enough, it could trigger the core clock boost mode and the CPU will run faster, therefore one core clock cycle will correspond to less than one reference clock cycle. As a result, it will appear that instructions executed in boosted regions take less cycles that instructions executed in regions of nominal clock frequency.

Also, when performing cycle-accurate measurements, always pin the process to a single CPU core, either using the taskset utility or the sched_setaffinity(2) syscall. The OS scheduler usually moves processes around the different cores in order to keep them equally loaded and that is an expensive process. The probability that this could happen during the execution of a small region of several instructions is very low, but for long loops it is much higher. Averaging over many iterations could decrease the severity of the migration, but one would still get skewed results. Pinning the process to a single core prevents this altogether.

like image 68
Hristo Iliev Avatar answered Jan 30 '23 00:01

Hristo Iliev


It looks to me that Strategy I, uses sqrtsd instruction. This is because ucomisd instruction does not set the parity flag and the code jumps directly to L4.

The for loop, Strategy II uses call sqrt to compute the square root. This may be an optimized version of sqrt, achieved through approximation, thus faster than the call to sqrtsd. Some compilers might do this optimization.

Even if the call sqrt uses in the back the sqrtsd instruction there is no reason for which it should run faster outside of the loop.

Please take note that measuring the latency of a single instruction executed only once is not deterministic. The rdtsc instruction has a latency of it's own and because nowadays CPUs are superscalar and out of order you cannot know that rdtsc sqrtsd rdtsc get executed completely in program order. They sure do not execute on the same port therefore the sqrtsd is not guaranteed to be complete at the time the second rdtsc completes.

Another important thing to consider is that when you execute sqrt in the loop, you decrease the average latency of the instructions. That's because you can have multiple instructions executing in parallel in different stages of the pipeline. Agner Fog's instruction tables shows that Ivy Bridge's sqrtsd instruction can have a throughput ranging from 1/8 to 1/14 instructions per cycle. The for loop increases the average throughput while the single isolated instruction will have the highest latency.

So because of the pipelined parallel execution, the instructions in the loop will run faster on average, than when running isolated.

like image 24
VAndrei Avatar answered Jan 30 '23 00:01

VAndrei