Variance in RDTSC overhead

Question

I'm constructing a micro-benchmark to measure performance changes as I experiment with the use of SIMD instruction intrinsics in some primitive image processing operations. However, writing useful micro-benchmarks is difficult, so I'd like to first understand (and if possible eliminate) as many sources of variation and error as possible.

One factor that I have to account for is the overhead of the measurement code itself. I'm measuring with RDTSC, and I'm using the following code to find the measurement overhead:

extern inline unsigned long long __attribute__((always_inline)) rdtsc64() {
    unsigned int hi, lo;
        __asm__ __volatile__(
            "xorl %%eax, %%eax
	"
            "cpuid
	"
            "rdtsc"
        : "=a"(lo), "=d"(hi)
        : /* no inputs */
        : "rbx", "rcx");
    return ((unsigned long long)hi << 32ull) | (unsigned long long)lo;
}

unsigned int find_rdtsc_overhead() {
    const int trials = 1000000;

    std::vector<unsigned long long> times;
    times.resize(trials, 0.0);

    for (int i = 0; i < trials; ++i) {
        unsigned long long t_begin = rdtsc64();
        unsigned long long t_end = rdtsc64();
        times[i] = (t_end - t_begin);
    }

    // print frequencies of cycle counts
}

When running this code, I get output like this:

Frequency of occurrence (for 1000000 trials):
234 cycles (counted 28 times)
243 cycles (counted 875703 times)
252 cycles (counted 124194 times)
261 cycles (counted 37 times)
270 cycles (counted 2 times)
693 cycles (counted 1 times)
1611 cycles (counted 1 times)
1665 cycles (counted 1 times)
... (a bunch of larger times each only seen once)

My questions are these:

What are the possible causes of the bi-modal distribution of cycle counts generated by the code above?
Why does the fastest time (234 cycles) only occur a handful of times—what highly unusual circumstance could reduce the count?

Further Information

Platform:

Linux 2.6.32 (Ubuntu 10.04)
g++ 4.4.3
Core 2 Duo (E6600); this has constant rate TSC.

SpeedStep has been turned off (processor is set to performance mode and is running at 2.4GHz); if running in 'ondemand' mode, I get two peaks at 243 and 252 cycles, and two (presumably corresponding) peaks at 360 and 369 cycles.

I'm using sched_setaffinity to lock the process to one core. If I run the test on each core in turn (i.e., lock to core 0 and run, then lock to core 1 and run), I get similar results for the two cores, except that the fastest time of 234 cycles tends to occur slightly fewer times on core 1 than on core 0.

Compile command is:

g++ -Wall -mssse3 -mtune=core2 -O3 -o test.bin test.cpp

The code that GCC generates for the core loop is:

.L105:
#APP
# 27 "test.cpp" 1
    xorl %eax, %eax
    cpuid
    rdtsc
# 0 "" 2
#NO_APP
    movl    %edx, %ebp
    movl    %eax, %edi
#APP
# 27 "test.cpp" 1
    xorl %eax, %eax
    cpuid
    rdtsc
# 0 "" 2
#NO_APP
    salq    $32, %rdx
    salq    $32, %rbp
    mov %eax, %eax
    mov %edi, %edi
    orq %rax, %rdx
    orq %rdi, %rbp
    subq    %rbp, %rdx
    movq    %rdx, (%r8,%rsi)
    addq    $8, %rsi
    cmpq    $8000000, %rsi
    jne .L105

Variance in RDTSC overhead

Tags:

Recent Activity

Donate For Us

Variance in RDTSC overhead

Tags:

Related questions

Recent Activity

Donate For Us