Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Comparing the time measured results produced by rdtsc clock and c++11 std::chrono::high_resolution_clock

I am trying to compare the times measured by c++11 std::chrono::high_resolution_clock and the rdtsc_clock clock at below. From the high_resolution_clock, I am getting result like 11000, 3000, 1000, 0. From the rdtsc_clock, I am getting 134, 15, 91 etc. Why their result look so differently? From my gut feeling, I believe the rdtsc_clock is presenting the ~accurate results, am I right?

    template<std::intmax_t clock_freq>
    struct rdtsc_clock {
        typedef unsigned long long rep;
        typedef std::ratio<1, clock_freq> period;
        typedef std::chrono::duration<rep, period> duration;
        typedef std::chrono::time_point<rdtsc_clock> time_point;
        static const bool is_steady = true;

        static time_point now() noexcept
        {

            unsigned lo, hi;
            asm volatile("rdtsc" : "=a" (lo), "=d" (hi));

            return time_point(duration(static_cast<rep>(hi) << 32 | lo));
        }
    };

The timing code:

typedef std::chrono::high_resolution_clock Clock;
//typedef rdtsc_clock<3300000000> Clock;
typedef std::chrono::nanoseconds nanoseconds;
typedef std::chrono::duration<double, typename Clock::period> Cycle;
for(int n=0; n < 10; n++){
   auto t1 = Clock::now();
   //codes
   auto t2 = Clock::now();
   printf(%.0f ns \n", duration_cast<nanoseconds>(Cycle(t2 - t1)).count());
}
like image 428
Bryan Fok Avatar asked Nov 01 '13 01:11

Bryan Fok


2 Answers

Issues with RDTSC usage

If you read some online docs on RDTSC, you'll see that it doesn't ensure instructions from after the RDTSC instruction aren't executed in the pipeline before the RDTSC instruction itself runs (nor that earlier instructions don't run afterwards). The normal advice is to use a CPUID instruction immediately before and/or after the RDTSC to trigger such "sequence points". Obviously this impacts program performance, and is more desirable for some kinds of measurements than others (where average throughput figures are of more interest than individual samples). You can expect that the Standard library implementation's being much more careful about this, which may be one reason its measurements are far higher.

Cross-Core Issues (not relevant per your comment)

Each CPU core maintains its own TSC register... if you just start taking samples on a thread that's not bound to a core, or on multiple threads not bound to the same core, you may see "weird" jumps in values. Some companies (e.g. Microsoft) insist that the Hardware Abstraction Laye (HAL) is responsible for trying to get the registers as close to in-sync as possible, but many (even brand new high end) PCs simply fail to do this.

You can get around this by binding to a core, or by doing some calibration step that measures the cross-core deltas (with some calibration error margin), then adjust later samples based on the core from which they're sampled (which itself is painful to determine on most CPUs - you'll need to spin taking samples between CPUID instructions or something similar).

like image 102
Tony Delroy Avatar answered Sep 20 '22 06:09

Tony Delroy


I think you are not comparing the same thing, on my mac this example work, rdtsc and std::chrono give the same number of cycle, if it can help:

#include <iostream>
#include <vector>
#include <numeric>
#include <chrono>

static __inline__ unsigned long long rdtsc(void){
    unsigned hi, lo;  
    __asm__ __volatile__ ("rdtsc" : "=a"(lo), "=d"(hi));
    return ( (unsigned long long)lo)|( ((unsigned long long)hi)<<32   );
}

static int sink = 0;

int main(){
    typedef std::ratio<1, 2400000000> period; // My mac @ 2.4 GHz
    unsigned long long int a,b;
    for (auto size = 1ull; size < 1000000000ull; size *= 100) {
        // record start time
        auto start = std::chrono::high_resolution_clock::now();
        a = rdtsc();
        // do some work
        std::vector<int> v(size, 42);
        sink = std::accumulate(v.begin(), v.end(), 0u); // make sure it's a side effect
        // record end time
        b = rdtsc();
        auto end = std::chrono::high_resolution_clock::now();
        std::chrono::duration<double, period> diff = end-start;
        std::cout << "Time to fill and iterate a vector of "
                  << size << " ints : " << diff.count() << " [cycle]"
                  << ", old style: "<< b-a << " [cycle] \n";
   }
}


Time to fill and iterate a vector of 1 ints : 13965.6 [cycle], old style: 13731 [cycle] 
Time to fill and iterate a vector of 100 ints : 1065.6 [cycle], old style: 969 [cycle] 
Time to fill and iterate a vector of 10000 ints : 68076 [cycle], old style: 67899 [cycle] 
Time to fill and iterate a vector of 1000000 ints : 5.4853e+06 [cycle], old style: 5483487 [cycle] 
Time to fill and iterate a vector of 100000000 ints : 6.57399e+08 [cycle], old style: 657395277 [cycle] 
like image 43
Timocafé Avatar answered Sep 22 '22 06:09

Timocafé