I'm doing an experiment of profiling the time it takes to compute a single sqrt in C code. I have two strategies.
One is direct measurement of a single sqrt call and the other is to execute sqrt multiple times in a for loop and then compute the average. The C code is very simple and is shown as follows:
long long readTSC(void);
int main(int argc, char** argv)
{
int n = atoi(argv[1]);
//v is input of sqrt() making sure compiler won't
//precompute the result of sqrt(v) if v is constant
double v = atof(argv[2]);.
long long tm; //track CPU clock cycles
double x; //result of sqrt()
//-- strategy I ---
tm = readTSC(); //A function that uses rdtsc instruction to get the number of clock cycles from Intel CPU
x = sqrt(v);
tm = readTSC() - tm;
printf("x=%15.6\n",x); //make sure compiler won't optimize out the above sqrt()
printf("%lld clocks\n",tm);
double sum = 0.0;
int i;
//-- strategy II --
tm = readTSC();
for ( i = 0; i < n; i++ )
sum += sqrt((double) i);
tm = readTSC() - tm;
printf("%lld clocks\n",tm);
printf("%15.6e\n",sum);
return 0;
}
long long readTSC(void)
{
/* read the time stamp counter on Intel x86 chips */
union { long long complete; unsigned int part[2]; } ticks;
__asm__ ("rdtsc; mov %%eax,%0;mov %%edx,%1"
: "=mr" (ticks.part[0]),
"=mr" (ticks.part[1])
: /* no inputs */
: "eax", "edx");
return ticks.complete;
}
Before running the code, I expected that the timing result of strategy I might be slightly smaller than the one from strategy II, because strategy II also counts the overhead incurred by the for loop and the sum addition.
I use the following command without O3 optimization to compile my code on Intel Xeon E5-2680 2.7GHz machine.
gcc -o timing -lm timing.c
However, the result shows that strategy I takes about 40 clock cycles while strategy II takes average of 21.8 clock cycles, almost half of the former one.
For your reference, I also pasted the related assembly code below with some comments. It occurs to me, based on the timing result, that each for iteration executes two sqrt()'s. But I can hardly tell from the assembly code how the CPU can actually execute two sqrt()'s call in parallel?
call atof
cvtsi2ss %eax, %xmm0
movss %xmm0, -36(%rbp)
//-- timing single sqrt ---
call readTSC
movq %rax, -32(%rbp)
movss -36(%rbp), %xmm1
cvtps2pd %xmm1, %xmm1
//--- sqrtsd instruction
sqrtsd %xmm1, %xmm0
ucomisd %xmm0, %xmm0
jp .L8
je .L4
.L8:
movapd %xmm1, %xmm0
//--- C function call sqrt()
call sqrt
.L4:
movsd %xmm0, -72(%rbp)
movq -72(%rbp), %rax
movq %rax, -24(%rbp)
call readTSC
//-- end of timing single sqrt ---
subq -32(%rbp), %rax
movq %rax, -32(%rbp)
movl $.LC0, %eax
movsd -24(%rbp), %xmm0
movq %rax, %rdi
movl $1, %eax
call printf
movl $.LC1, %eax
movq -32(%rbp), %rdx
movq %rdx, %rsi
movq %rax, %rdi
movl $0, %eax
call printf
movl $0, %eax
movq %rax, -16(%rbp)
call readTSC
//-- start of for loop----
movq %rax, -32(%rbp)
movl $0, -4(%rbp)
jmp .L5
.L6:
//(double) i
cvtsi2sd -4(%rbp), %xmm0
//-- C function call sqrt()
call sqrt
movsd -16(%rbp), %xmm1
//add sqrt(i) to sum (%xmm0)
addsd %xmm1, %xmm0
movsd %xmm0, -16(%rbp)
//i++
addl $1, -4(%rbp)
.L5:
movl -4(%rbp), %eax
//check i<n
cmpl -40(%rbp), %eax
jl .L6
//-- end of for loop--
//you can skip the rest of the part.
call readTSC
subq -32(%rbp), %rax
movq %rax, -32(%rbp)
movl $.LC1, %eax
movq -32(%rbp), %rdx
movq %rdx, %rsi
movq %rax, %rdi
movl $0, %eax
call printf
movl $.LC3, %eax
movsd -16(%rbp), %xmm0
movq %rax, %rdi
movl $1, %eax
call printf
E5-2680 is a Sandy Bridge CPU and both the latency and the reciprocal throughput for SQRTSD
is 10 to 21 cycles/instr. So in loop or not, you should measure something close to the observed 21.8 cycles. The sqrt
function in GLIBC simply checks the sign of the argument and arranges for the non-negative branch to get executed speculatively via branch prediction, which in turn is a call to __ieee754_sqrt
, which itself is simple inline assembly routine that on x86-64 systems emits sqrtsd %xmm0, %xmm0
.
The CPU uses register renaming to handle data dependency. Thus it could have two copies of sqrtsd %xmm0, %xmm0
at different stages of execution in the pipeline. Since the result of sqrt
is not needed immediately, other instructions could be executed while sqrt
is being processed and that's why you measure only 21.8 cycles on average.
As for the larger value in the first case, RDTSC
does not have the resolution of a single cycle. It has a certain latency, so you are basically measuring T_code_block + T_rdtsc_latency
. In the second scenario, averaging over the iterations gives:
(T_code_block * n_iters + T_rdtsc_latency) / n_iters =
= T_code_block + (T_rdtsc_latency / n_iters)
For large n_iters
, the second term vanishes and you get a very accurate measurement of a single iteration.
One has to be very careful when benchmarking with RDTSC
. TSC
itself ticks on modern CPUs at the reference clock speed. If the loop runs for long enough, it could trigger the core clock boost mode and the CPU will run faster, therefore one core clock cycle will correspond to less than one reference clock cycle. As a result, it will appear that instructions executed in boosted regions take less cycles that instructions executed in regions of nominal clock frequency.
Also, when performing cycle-accurate measurements, always pin the process to a single CPU core, either using the taskset
utility or the sched_setaffinity(2)
syscall. The OS scheduler usually moves processes around the different cores in order to keep them equally loaded and that is an expensive process. The probability that this could happen during the execution of a small region of several instructions is very low, but for long loops it is much higher. Averaging over many iterations could decrease the severity of the migration, but one would still get skewed results. Pinning the process to a single core prevents this altogether.
It looks to me that Strategy I, uses sqrtsd instruction. This is because ucomisd instruction does not set the parity flag and the code jumps directly to L4.
The for loop, Strategy II uses call sqrt to compute the square root. This may be an optimized version of sqrt, achieved through approximation, thus faster than the call to sqrtsd. Some compilers might do this optimization.
Even if the call sqrt
uses in the back the sqrtsd
instruction there is no reason for which it should run faster outside of the loop.
Please take note that measuring the latency of a single instruction executed only once is not deterministic. The rdtsc
instruction has a latency of it's own and because nowadays CPUs are superscalar and out of order you cannot know that rdtsc sqrtsd rdtsc
get executed completely in program order. They sure do not execute on the same port therefore the sqrtsd
is not guaranteed to be complete at the time the second rdtsc
completes.
Another important thing to consider is that when you execute sqrt in the loop, you decrease the average latency of the instructions. That's because you can have multiple instructions executing in parallel in different stages of the pipeline. Agner Fog's instruction tables shows that Ivy Bridge's sqrtsd instruction can have a throughput ranging from 1/8 to 1/14 instructions per cycle. The for loop increases the average throughput while the single isolated instruction will have the highest latency.
So because of the pipelined parallel execution, the instructions in the loop will run faster on average, than when running isolated.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With