Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to count clock cycles with RDTSC in GCC x86? [duplicate]

With Visual Studio I can read the clock cycle count from the processor as shown below. How do I do the same thing with GCC?

#ifdef _MSC_VER             // Compiler: Microsoft Visual Studio

    #ifdef _M_IX86                      // Processor: x86

        inline uint64_t clockCycleCount()
        {
            uint64_t c;
            __asm {
                cpuid       // serialize processor
                rdtsc       // read time stamp counter
                mov dword ptr [c + 0], eax
                mov dword ptr [c + 4], edx
            }
            return c;
        }

    #elif defined(_M_X64)               // Processor: x64

        extern "C" unsigned __int64 __rdtsc();
        #pragma intrinsic(__rdtsc)
        inline uint64_t clockCycleCount()
        {
            return __rdtsc();
        }

    #endif

#endif
like image 520
Johan Råde Avatar asked Mar 27 '12 10:03

Johan Råde


People also ask

Does RDTSC count the number of core clock cycles?

Without this, RDTSC does count core clock cycles. nonstop_tsc: This feature is called the invariant TSC in the Intel SDM manual and is supported on processors with CPUID.80000007H:EDX [8]. The TSC keeps ticking even in deep sleep C-states.

What is RDTSC performance timer in C++?

The RDTSC Performance Timer written in C++. The RDTSC is the IA-32/IA-64 (or x86/x64) instruction that loads the current value of the processors’ time stamp into EDX:EAX registers. RDTSC is short for “Read Time-Stamp Counter”. It returns the number of clock cycles since last reset.

Does RDTSC count as a performance counter?

It counts at a fixed frequency regardless of turbo / power-saving, so if you want uops-per-clock analysis, use performance counters. rdtsc is exactly correlated with wall-clock time (not counting system clock adjustments, so it's a perfect time source for steady_clock ).

What is the RDTSC instruction used for?

Generates the rdtsc instruction, which returns the processor time stamp. The processor time stamp records the number of clock cycles since the last reset. A 64-bit unsigned integer representing a tick count. This routine is available only as an intrinsic.


4 Answers

The other answers work, but you can avoid inline assembly by using GCC's __rdtsc intrinsic, available by including x86intrin.h.

It is defined at: gcc/config/i386/ia32intrin.h:

/* rdtsc */
extern __inline unsigned long long
__attribute__((__gnu_inline__, __always_inline__, __artificial__))
__rdtsc (void)
{
  return __builtin_ia32_rdtsc ();
}
like image 100
Evan Shaw Avatar answered Oct 09 '22 23:10

Evan Shaw


On recent versions of Linux gettimeofday will incorporate nanosecond timings.

If you really want to call RDTSC you can use the following inline assembly:

http://www.mcs.anl.gov/~kazutomo/rdtsc.html

#if defined(__i386__)

static __inline__ unsigned long long rdtsc(void)
{
    unsigned long long int x;
    __asm__ volatile (".byte 0x0f, 0x31" : "=A" (x));
    return x;
}

#elif defined(__x86_64__)

static __inline__ unsigned long long rdtsc(void)
{
    unsigned hi, lo;
    __asm__ __volatile__ ("rdtsc" : "=a"(lo), "=d"(hi));
    return ( (unsigned long long)lo)|( ((unsigned long long)hi)<<32 );
}

#endif
like image 35
Andrew Tomazos Avatar answered Oct 10 '22 00:10

Andrew Tomazos


Update: reposted and updated this answer on a more canonical question. I'll probably delete this at some point once we sort out which question to use as the duplicate target for closing all the similar rdtsc questions.


You don't need and shouldn't use inline asm for this. There's no benefit; compilers have built-ins for rdtsc and rdtscp, and (at least these days) all define a __rdtsc intrinsic if you include the right headers. https://gcc.gnu.org/wiki/DontUseInlineAsm

Unfortunately MSVC disagrees with everyone else about which header to use for non-SIMD intrinsics. (Intel's intriniscs guide says #include <immintrin.h> for this, but with gcc and clang the non-SIMD intrinsics are mostly in x86intrin.h.)

#ifdef _MSC_VER
#include <intrin.h>
#else
#include <x86intrin.h>
#endif

// optional wrapper if you don't want to just use __rdtsc() everywhere
inline
unsigned long long readTSC() {
    // _mm_lfence();  // optionally wait for earlier insns to retire before reading the clock
    return __rdtsc();
    // _mm_lfence();  // optionally block later instructions until rdtsc retires
}

Compiles with all 4 of the major compilers: gcc/clang/ICC/MSVC, for 32 or 64-bit. See the results on the Godbolt compiler explorer.

For more about using lfence to improve repeatability of rdtsc, see @HadiBrais' answer on clflush to invalidate cache line via C function.

See also Is LFENCE serializing on AMD processors? (TL:DR yes with Spectre mitigation enabled, otherwise kernels leave the relevant MSR unset.)


rdtsc counts reference cycles, not CPU core clock cycles

It counts at a fixed frequency regardless of turbo / power-saving, so if you want uops-per-clock analysis, use performance counters. rdtsc is exactly correlated with wall-clock time (except for system clock adjustments, so it's basically steady_clock). It ticks at the CPU's rated frequency, i.e. the advertised sticker frequency.

If you use it for microbenchmarking, include a warm-up period first to make sure your CPU is already at max clock speed before you start timing. Or better, use a library that gives you access to hardware performance counters, or a trick like perf stat for part of program if your timed region is long enough that you can attach a perf stat -p PID. You usually will still want to avoid CPU frequency shifts during your microbenchmark, though.

  • std::chrono::clock, hardware clock and cycle count
  • Getting cpu cycles using RDTSC - why does the value of RDTSC always increase?
  • Lost Cycles on Intel? An inconsistency between rdtsc and CPU_CLK_UNHALTED.REF_TSC

It's also not guaranteed that the TSCs of all cores are in sync. So if your thread migrates to another CPU core between __rdtsc(), there can be an extra skew. (Most OSes attempt to sync the TSCs of all cores, though.) If you're using rdtsc directly, you probably want to pin your program or thread to a core, e.g. with taskset -c 0 ./myprogram on Linux.


How good is the asm from using the intrinsic?

It's at least as good as anything you could do with inline asm.

A non-inline version of it compiles MSVC for x86-64 like this:

unsigned __int64 readTSC(void) PROC                             ; readTSC
    rdtsc
    shl     rdx, 32                             ; 00000020H
    or      rax, rdx
    ret     0
  ; return in RAX

For 32-bit calling conventions that return 64-bit integers in edx:eax, it's just rdtsc/ret. Not that it matters, you always want this to inline.

In a test caller that uses it twice and subtracts to time an interval:

uint64_t time_something() {
    uint64_t start = readTSC();
    // even when empty, back-to-back __rdtsc() don't optimize away
    return readTSC() - start;
}

All 4 compilers make pretty similar code. This is GCC's 32-bit output:

# gcc8.2 -O3 -m32
time_something():
    push    ebx               # save a call-preserved reg: 32-bit only has 3 scratch regs
    rdtsc
    mov     ecx, eax
    mov     ebx, edx          # start in ebx:ecx
      # timed region (empty)

    rdtsc
    sub     eax, ecx
    sbb     edx, ebx          # edx:eax -= ebx:ecx

    pop     ebx
    ret                       # return value in edx:eax

This is MSVC's x86-64 output (with name-demangling applied). gcc/clang/ICC all emit identical code.

# MSVC 19  2017  -Ox
unsigned __int64 time_something(void) PROC                            ; time_something
    rdtsc
    shl     rdx, 32                  ; high <<= 32
    or      rax, rdx
    mov     rcx, rax                 ; missed optimization: lea rcx, [rdx+rax]
                                     ; rcx = start
     ;; timed region (empty)

    rdtsc
    shl     rdx, 32
    or      rax, rdx                 ; rax = end

    sub     rax, rcx                 ; end -= start
    ret     0
unsigned __int64 time_something(void) ENDP                            ; time_something

All 4 compilers use or+mov instead of lea to combine the low and high halves into a different register. I guess it's kind of a canned sequence that they fail to optimize.

But writing it in inline asm yourself is hardly better. You'd deprive the compiler of the opportunity to ignore the high 32 bits of the result in EDX, if you're timing such a short interval that you only keep a 32-bit result. Or if the compiler decides to store the start time to memory, it could just use two 32-bit stores instead of shift/or / mov. If 1 extra uop as part of your timing bothers you, you'd better write your whole microbenchmark in pure asm.

like image 43
Peter Cordes Avatar answered Oct 09 '22 23:10

Peter Cordes


On Linux with gcc, I use the following:

/* define this somewhere */
#ifdef __i386
__inline__ uint64_t rdtsc() {
  uint64_t x;
  __asm__ volatile ("rdtsc" : "=A" (x));
  return x;
}
#elif __amd64
__inline__ uint64_t rdtsc() {
  uint64_t a, d;
  __asm__ volatile ("rdtsc" : "=a" (a), "=d" (d));
  return (d<<32) | a;
}
#endif

/* now, in your function, do the following */
uint64_t t;
t = rdtsc();
// ... the stuff that you want to time ...
t = rdtsc() - t;
// t now contains the number of cycles elapsed
like image 27
a3nm Avatar answered Oct 10 '22 00:10

a3nm